[query] Handle SQL NULL for aggregates over 0 rows. (#684)

mozilla / mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.

https://mozilla.github.io/mentat/

Apache License 2.0

1.65k stars 115 forks source link

[query] Handle SQL NULL for aggregates over 0 rows. (#684) #688

Closed ncalexan closed 6 years ago

rnewman commented 6 years ago

This seems like a perfectly fine 'code' solution to the problem.

I believe another solution is available in SQL.

[:find (max ?x) (count ?x) :where [_ :db/txInstant ?x]]

turns to

SELECT max(`?x`) AS `(max ?x)`, count(`?x`) AS `(count ?x)` 
FROM (
  SELECT DISTINCT `datoms00`.v AS `?x`
  FROM `datoms` AS `datoms00`
  WHERE `datoms00`.a = 3
)

and SQLite says:

Aggregate max() returns NULL if and only if there are no non-NULL values in the group.

so we should handle that case of no ?xs being bound.

If we just tag on the end there…

WHERE `(max ?x)` IS NOT NULL

then we don't need to make any changes to the results handling code.

We'd do this in re_project, collecting the relevant (might-be-null) columns just as we collect group_by_cols in CombinedProjection.

What do you think, @ncalexan?

ncalexan commented 6 years ago

What do you think, @ncalexan?

Yeah, this is much better. Wish I'd waited before doing the most obvious thing :/

ncalexan commented 6 years ago

WHERE (max ?x) IS NOT NULL

Sadly, this doesn't seem valid: both

WHERE `(max ?x)` IS NOT NULL

and

WHERE `max(?x)` IS NOT NULL

yield "misuse of aggregate: sum()" from SQLite.

rnewman commented 6 years ago

In both of those the backticks enclose the identifier. Try

WHERE max(`?x`) IS NOT NULL

rnewman commented 6 years ago

Oh, never mind; that's not valid. The original formulation I suggested should work:

sqlite> create table foo ( x integer );
sqlite> select max(x) from foo;
max(x)
----------

sqlite> select max(x) from foo where max(x) is not null;
Error: misuse of aggregate: max()
sqlite> select `max(?x)` from (select max(x) as `max(?x)` from foo) where `max(?x)` is not null;
sqlite>

We already use that inner query technique for GROUP BY.

ncalexan commented 6 years ago

We already use that inner query technique for GROUP BY.

I thought of this, but wasn't sure the impact of nesting yet another SELECT. Do you think it's worthwhile?

rnewman commented 6 years ago

We always group for aggregates, so use that sub select?

ncalexan commented 6 years ago

We always group for aggregates, so use that sub select?

This is not true, for reasons I'm still trying to understand. This isn't quite as straight-forward as it looks, because we need to use HAVING instead of WHERE; in the course of experimenting, I discovered that we don't always GROUP BY, and that HAVING requires GROUP BY. I'm trying to figure out what no GROUP BY means when we have two aggregates right now.

rnewman commented 6 years ago

We group when there are non-aggregate bindings returned. The tests for aggregates in the top-level query tests should explain the differences between aggregate-only, aggregate with var, aggregate with :with, and aggregate with the.

My point was that we have the mechanism in place to generate that subquery in order to handle the need for grouping; we can extend that to grouping and/or null-dropping.

ncalexan commented 6 years ago

My point was that we have the mechanism in place to generate that subquery in order to handle the need for grouping; we can extend that to grouping and/or null-dropping.

In some way, yes -- but as I'm discovering, this is not nearly as simple as we initially thought. Newest challenge is to figure out how to use HAVING where there's no GROUP BY and the aggregates are independent. I think it's easier to just use another SELECT in this case, like

SELECT * FROM (...) WHERE `(max ?a)` IS NOT NULL

ncalexan commented 6 years ago

@rnewman this isn't elegant but it's simple. What do you think?