probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
918 stars 64 forks source link

SIMULATE...GIVEN syntax is unlike SELECT...WHERE and may cause unintended errors #605

Open versar opened 6 years ago

versar commented 6 years ago

The problem is that using the word AND to express multiple constraints in a SIMULATE query is incorrect, however that syntax is correct in a SELECT query so a user may use it by accident. When this happens, no error is returned. The query runs normally. However, the wrong results are produced.

AND works in SELECT queries. In specific, the BQL or SQL expression 1: %bql SELECT var1, var2 FROM t WHERE "var1"=val1 AND "var2"=val2 is interpreted as: 2: %bql SELECT var1, var2 FROM t WHERE ("var1"=val1) AND ("var2"=val2)

However AND does not work normally in SIMULATE queries. In specific, the BQL expression 3: %bql SIMULATE var1, var2 FROM p WHERE "var1"=val1 AND "var2"=val2 LIMIT 10 is interpreted as: 4: %bql SIMULATE var1, var2 FROM p WHERE "var1" = (val1 AND "var2" = val2) LIMIT 10

Therefore, if a user invokes query 3 above, they will not get the answer they expect. Currently, the correct way to write multiple constraints in a SIMULATE expression is for them to be separated by commas, not by AND. Therefore the user will only get the right answer by using the syntax below: 5: %bql SIMULATE var1, var2 FROM p WHERE "var1"=val1, "var2"=val2 LIMIT 10

This is a problem because a user (like me!) who believes naturally that SELECT and SIMULATE have parallel syntax is susceptible to accidentally use AND with SIMULATE instead of commas. The bug is especially insidious because the result will silently be wrong, without an error. And, the results will look almost reasonable -- if the constraints are ignored, the variable will usually just be simulated at values near its overall mean. This is a trap for the unwary, especially when the incorrect results propagate through the rest of an analysis.

Possible resolutions:

  1. Return a syntax error when the compiler sees a SIMULATE query with multiple ANDs instead of commas.
  2. Compile an SIMULATE query with multiple ANDs to interpret the query as if it has commas.