verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Fix nullptr exception when aggregate column contains null value #322

Closed Beastjoe closed 5 years ago

Beastjoe commented 5 years ago

Related issue #321.

The problem is in the execution of selectAsyncAggNode, where H2 jdbc driver will return NULL value if no rows are selected for aggregating. I think adding the check inside QueryResultAccuracyEstimatorFromDifference is enough, since DbmsQueryResult will interpret NULL as 0 automatically.

codecov-io commented 5 years ago

Codecov Report

Merging #322 into master will decrease coverage by 0.08%. The diff coverage is 61.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
- Coverage   70.66%   70.59%   -0.07%     
==========================================
  Files         167      167              
  Lines       11178    11198      +20     
  Branches     1832     1839       +7     
==========================================
+ Hits         7898     7904       +6     
- Misses       2827     2835       +8     
- Partials      453      459       +6
Impacted Files Coverage Δ
...in/java/org/verdictdb/core/sqlobject/ColumnOp.java 63.2% <100%> (+0.6%) :arrow_up:
...or/QueryResultAccuracyEstimatorFromDifference.java 84.16% <57.9%> (-8.61%) :arrow_down:
...verdictdb/core/querying/ola/InMemoryAggregate.java 84.79% <0%> (-2.89%) :arrow_down:
...ain/java/org/verdictdb/commons/DBTablePrinter.java 53.8% <0%> (-2.75%) :arrow_down:
...ictdb/core/querying/QueryNodeWithPlaceHolders.java 88.28% <0%> (-0.68%) :arrow_down:
...a/org/verdictdb/jdbc41/VerdictStreamResultSet.java 8.56% <0%> (+0.27%) :arrow_up:
.../verdictdb/core/execplan/ExecutableNodeRunner.java 81.12% <0%> (+0.86%) :arrow_up:
...org/verdictdb/sqlreader/ScrambleTableReplacer.java 87.26% <0%> (+1.97%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f6e366a...50a45d9. Read the comment docs.

pyongjoo commented 5 years ago

Could we somehow keep NULL values for SUM? According to http://www.mysqltutorial.org/mysql-sum/, the standard behavior seems to be that SUM must return NULL when no rows are chosen. But, COUNT must return 0 when no rows are chosen.

Beastjoe commented 5 years ago

I think currently, the value inside DbmsQueryResult is NULL if sum select no rows. However, when checking converge in AccuracyEstimator, NULL will be treated as 0. Is this implementation correct?

pyongjoo commented 5 years ago

I think the behavior is correct. Can you add unit tests for confirming it?

Beastjoe commented 5 years ago

Yes. One more question: when user try to get null value using methods like getDouble(), will methods return 0 or null? The current implementation will return 0 if the value is null.

pyongjoo commented 5 years ago

So, how getValue() must be implemented and getDouble() must be implemented are two separate issues.

getValue() must return true values (e.g., NULL for the summation of an empty set). However, if the user anyway wants an answer using getDouble(), we must cast it to some double value. For this conversion, I was following the JDBC standard, but you can double-check.

Beastjoe commented 5 years ago

Since we sum up the aggregate values in selectAsyncAggExecutionNode, count() will also return NULL if no rows are selected. Does it matter?

pyongjoo commented 5 years ago

Can you explain more details why the sum of counts will be NULL?

The individual counts could be zero when no rows are chosen; thus, the sum of them could also be zero.

Beastjoe commented 5 years ago

Let me double check this.

Beastjoe commented 5 years ago

The reason is when the aggregation query contains group-by and no row is selected, it will return empty set. Therefore, DbmsQueryResult returned from SelectAggExecutionNode contains nothing. Then, SelectAggAsyncAggExecutionNode tries to sum up the result, which results in NULL value for count().

pyongjoo commented 5 years ago

Hmm. This is interesting and tricky. Let's examine the original select query, and in case a select item includes count but not other aggregates, replace NULL -> 0. Otherwise, I don't see any harm in keeping null values. FYI, the original select query is already there in the accuracy estimator. I included it for determining aggregate columns.

When it comes to the convergence test, the test must fail for both cases (either 0 or NULL). I think zero values naturally make the test fail, so probably need to add a special treatment for NULL.

Beastjoe commented 5 years ago

I have made the changes. Please have a check.