nene / sql-parser-cst

Parses SQL into Concrete Syntax Tree (CST)
GNU General Public License v2.0
115 stars 7 forks source link

Performance issues with many logical conditions #52

Closed imathews closed 4 months ago

imathews commented 4 months ago

Hi, and thanks for all the great work on this library! We've recently run into some pretty significant performance issues with complex logical conditions (first observed in 0.17.1, and still present on the latest 0.21.2 release).

I'm including the reproducible snippet below, where parsing a single CASE statement takes ~900ms (latest Chrome on an M1 MBP). Converting the statement to a simple logical expression improves performance a bit (~250ms), but is still pretty slow. In the real-world example that this code is derived from, the CASE statement contains ~36 conditions, which leads to a parse time of around 45 seconds.

I've tested this with different dialects, and all show the same results, except for postgres, which continued to spin for well over a minute before I stopped it. Please let me know if there's any additional information that I can provide that would be helpful.

-- query1
  SELECT
    CASE
      WHEN ((t2.dgns_prcdr=1) AND (((t26.dgns_prcdr=1) AND (t2.complication=t26.complication)) OR ((t27.dgns_prcdr=1) AND (t2.complication=t27.complication)) OR ((t28.dgns_prcdr=1) AND (t2.complication=t28.complication)) OR ((t29.dgns_prcdr=1) AND (t2.complication=t29.complication)) OR ((t30.dgns_prcdr=1) AND (t2.complication=t30.complication)) OR ((t31.dgns_prcdr=1) AND (t2.complication=t31.complication)) OR ((t32.dgns_prcdr=1) AND (t2.complication=t32.complication)) OR ((t33.dgns_prcdr=1) AND (t2.complication=t33.complication)) OR ((t34.dgns_prcdr=1) AND (t2.complication=t34.complication)) OR ((t35.dgns_prcdr=1) AND (t2.complication=t35.complication)) OR ((t36.dgns_prcdr=1) AND (t2.complication=t36.complication)) OR ((t37.dgns_prcdr=1) AND (t2.complication=t37.complication)) OR ((t38.dgns_prcdr=1) AND (t2.complication=t38.complication)) OR ((t39.dgns_prcdr=1) AND (t2.complication=t39.complication)) OR ((t40.dgns_prcdr=1) AND (t2.complication=t40.complication)) OR ((t41.dgns_prcdr=1) AND (t2.complication=t41.complication)) OR ((t42.dgns_prcdr=1) AND (t2.complication=t42.complication)) OR ((t43.dgns_prcdr=1) AND (t2.complication=t43.complication)) OR ((t44.dgns_prcdr=1) AND (t2.complication=t44.complication)) OR ((t45.dgns_prcdr=1) AND (t2.complication=t45.complication)) OR ((t46.dgns_prcdr=1) AND (t2.complication=t46.complication)) OR ((t47.dgns_prcdr=1) AND (t2.complication=t47.complication)) OR ((t48.dgns_prcdr=1) AND (t2.complication=t48.complication)) OR ((t49.dgns_prcdr=1) AND (t2.complication=t49.complication)))) THEN 1
     ELSE 0
    END AS dgns_prcdr
  FROM t0
 -- ... bunch of left joins for t1, t2, ...t46, not relevant to reproduce
-- query2: A bit faster if we remove the CASE statement, but still not great
  SELECT
   ((t2.dgns_prcdr=1) AND (((t26.dgns_prcdr=1) AND (t2.complication=t26.complication)) OR ((t27.dgns_prcdr=1) AND (t2.complication=t27.complication)) OR ((t28.dgns_prcdr=1) AND (t2.complication=t28.complication)) OR ((t29.dgns_prcdr=1) AND (t2.complication=t29.complication)) OR ((t30.dgns_prcdr=1) AND (t2.complication=t30.complication)) OR ((t31.dgns_prcdr=1) AND (t2.complication=t31.complication)) OR ((t32.dgns_prcdr=1) AND (t2.complication=t32.complication)) OR ((t33.dgns_prcdr=1) AND (t2.complication=t33.complication)) OR ((t34.dgns_prcdr=1) AND (t2.complication=t34.complication)) OR ((t35.dgns_prcdr=1) AND (t2.complication=t35.complication)) OR ((t36.dgns_prcdr=1) AND (t2.complication=t36.complication)) OR ((t37.dgns_prcdr=1) AND (t2.complication=t37.complication)) OR ((t38.dgns_prcdr=1) AND (t2.complication=t38.complication)) OR ((t39.dgns_prcdr=1) AND (t2.complication=t39.complication)) OR ((t40.dgns_prcdr=1) AND (t2.complication=t40.complication)) OR ((t41.dgns_prcdr=1) AND (t2.complication=t41.complication)) OR ((t42.dgns_prcdr=1) AND (t2.complication=t42.complication)) OR ((t43.dgns_prcdr=1) AND (t2.complication=t43.complication)) OR ((t44.dgns_prcdr=1) AND (t2.complication=t44.complication)) OR ((t45.dgns_prcdr=1) AND (t2.complication=t45.complication)) OR ((t46.dgns_prcdr=1) AND (t2.complication=t46.complication)) OR ((t47.dgns_prcdr=1) AND (t2.complication=t47.complication)) OR ((t48.dgns_prcdr=1) AND (t2.complication=t48.complication)) OR ((t49.dgns_prcdr=1) AND (t2.complication=t49.complication)))) AS dgns_prcdr
  FROM t0
 -- ... bunch of left joins for t1, t2, ...t46, not relevant to reproduce
parser.parse(query1, { dialect: 'bigquery' })
// -> 900 ms

parser.parse(query2, { dialect: 'bigquery' })
// -> 250 ms
nene commented 4 months ago

Thanks for reporting. This is definitely unacceptably slow. I'll see what I can do.

nene commented 4 months ago

This should now be fixed in 0.22.0 release, where according to my testing the above SQL parses in a few milliseconds:

sqlite x 300 ops/sec ±7.58% (74 runs sampled)
    Mean: 0.0033 seconds
mysql x 236 ops/sec ±9.31% (69 runs sampled)
    Mean: 0.0042 seconds
bigquery x 345 ops/sec ±1.65% (89 runs sampled)
    Mean: 0.0029 seconds
postgresql x 308 ops/sec ±1.60% (86 runs sampled)
    Mean: 0.0032 seconds

Feel free to report if you notice any slowness elsewhere. I have mainly been concentrating on feature-completeness, so I suspect there are more areas with lousy performance.

imathews commented 4 months ago

Amazing! Thank you