Queries with where() and group()

mshepanski / quince

QUeries IN C++ Expressions

Boost Software License 1.0

31 stars 18 forks source link

Queries with where() and group() #6

Closed oranja closed 8 years ago

oranja commented 8 years ago

Hello again, Before the rant begins, I'd like to thank you for quince. I am moving code to use quince and it made much of it so much more readable and straightforward. IMHO, you've done an incredible job so far.

Now for the issue at hand. Following the manual page on Compositionality, I understand that it's not necessary to distinct SQL's WHERE from HAVING when writing a quince query. As further approval, there is no mention of a having() method in the manuals. So I wrote the following query, which is a complex as it gets in the current project:

const quince::predicate is_empty_family = quince::count(schema->plates_families->plate_id) == int64_t(0);

query<serial> empty_families =
    schema->plates_families
    .where(is_empty_family)
    .group(schema->plates_families->family_id)
    .select(schema->plates_families->family_id);

schema->families
.where(in(schema->families->id, empty_families))
.remove();

(plates_families is an auxiliary table facilitating a many-to-many link between plates and families. after a removal of any of these links, I want to remove any family that no longer participates in any relation with any of the plates).

Running some test code, I get this error:

terminate called after throwing an instance of 'quince::dbms_exception'
  what():  dbms-detected error: SQL logic error or missing database
      (most recent SQL command was `

                DELETE FROM "families"
                 WHERE "families"."id" IN (
                     SELECT "plates_families"."family_id" AS r$22
                     FROM "plates_families"
                     WHERE ("count"("plates_families"."plate_id")) = (?1)
                     GROUP BY r$22 )

' )

And the problem is clear: The inner WHERE should have been replaced with a HAVING statement and put after the GROUP BY statement. I tried to help quince by putting the .where() call after the .group() call in the query, but compilation fails with an error, stating that a grouping object doesn't have a where() method.

Did I miss anything?

For the record: I'm using the dev branch with gcc 5.1.1. Is there any more information I should supply to help diagnose the problem?

oranja commented 8 years ago

Additionally, trying to workaround this problem, I tried to nest the grouping subquery and therefore skip the need for a HAVING clause. As shown here:

const quince::predicate is_empty_family = quince::count(schema->plates_families->plate_id) == int64_t(0);

auto family_counts =
    schema->plates_families
    .group(schema->plates_families->family_id)
    .select(schema->plates_families->family_id,
        quince::count(schema->plates_families->plate_id));

query<serial> empty_families =
    family_counts
    .where(is_empty_family)
    .select(schema->plates_families->family_id);

However the resulting SQL query was exactly the same as the previous, and so the error was exactly the same error as before.

Thank you for your time.

mshepanski commented 8 years ago

Hello again. I'm glad you're finding quince useful, and please: spread the word if you can. The more users, the more bug reports like this, the better the product will be.

I'm going to look into this bug today, but meanwhile I think you can work around it by changing this:

.group(schema->plates_families->family_id)
.select(schema->plates_families->family_id)

to this:

.select(schema->plates_families->family_id)
.distinct()

(In fact that's a workaround that you might prefer to keep, even when the bug is gone.)

mshepanski commented 8 years ago

Looking at your code again, I think I see some problems.

Let's start with your first version. Bearing in mind the compositionality principle (aka "pipeline thinking"), what this code does is build a query schema->plates_families.where(is_empty_family), and then build a more complex query from that. But I think there's a problem right here in the simple query. If we unpack the shorthand is_empty_family, this is what we have:

schema->plates_families
.where(quince::count(schema->plates_families->plate_id) == int64_t(0))

and I think you'll agree that that is not a valid use of an aggregate function.

Your remarks about HAVING indicate to me that you want the record-filtering to be applied to the results of the group()-select() operation. So you should say that: build a query that does group()-select(), and then apply where() to it.

Here I think that your second version looks better, but it has a different and more subtle problem. Consider just family_counts.where(is_empty_family), or in other words:

family_counts
.where(quince::count(schema->plates_families->plate_id) == int64_t(0))

The problem is that schema->plates_families->plate_id is not visible in this context. See the Visibility page, especially Example 4.

There are two ways to correct this. Either you could construct the object quince::count(schema->plates_families->plate_id) just once (it's an exprn_mapper<int64_t>), and use the very same object in both family_counts and the where() clause; or you could leave the definition of family_counts alone, and write .where(family_counts->get<1>() == int64_t(0)) (as described here).

It you fix your second version in either of those ways, then I think it should work. If not, then please try to strip the example down to the simplest form that fails (e.g. do you need to put the query in the context of a remove()? -- and so on), and add the necessary declarations so that I can run it, and I'll try to track the bug down, whether it's in your code or mine.

One more thought, not related to bugs. Your removal of duplicates (what you did with group(), and I suggested that you could do with distinct()) is only for optimization, right? Then I wonder whether, if you didn't do that, and instead handed the DBMS a simpler query to optimize by itself, it would run faster or slower or just the same.

oranja commented 8 years ago

So I think I got what you were saying about visibility (used the same exprn_mapper for all occurrences) and piping (putting where after group()-select()). I changed the inner query code to:

const quince::exprn_mapper<serial> plate_id_exprn = schema->plates_families->plate_id;
const quince::predicate is_empty_family = quince::count(plate_id_exprn) == int64_t(0);
query<serial> empty_families =
    schema->plates_families
   .group(plate_id_exprn)
   .select(plate_id_exprn)
   .where(is_empty_family);

But the problem is exactly the same. It also doesn't matter if where() comes first or last in the query. You're the boss, but I consider this a bug. I believe the syntax should not be more complicated than as above. I wouldn't mind having to call having() explicitly instead of relying on quince to know when a select() should actually emit a HAVING.

This didn't work either:

const quince::exprn_mapper<serial> plate_id_exprn = schema->plates_families->plate_id;
const quince::exprn_mapper<int64_t> count_exprn = quince::count(plate_id_exprn);
const quince::predicate is_empty_family = count_exprn == int64_t(0);

auto family_counts =
    schema->plates_families
    .group(plate_id_exprn)
    .select(schema->plates_families->family_id,
            quince::count(plate_id_exprn));

query<serial> empty_families =
    family_counts
    .where(is_empty_family)
    .select(schema->plates_families->family_id);

And unfortunately, making the last query use get<>() also doesn't alter the final query.

query<serial> empty_families =
        family_counts
        .where(family_counts->get<1>() == int64_t(0))
        .select(family_counts->get<0>());

The choice between group() and distinct() has nothing to do with optimization for me. It is about intentions. Whenever I only want to get distinct values for some column(s), I will use distinct(). Whenever I want the RDBMS to create groups, so that I may apply aggregate/analytic functions and filter groups with HAVING, I will use group(). If anything, I would expect the distinct query to return faster than the group-by query. Actually many RDBMS engines convert group-by queries to a distinct queries if no aggregate function is used.

Just for the record, this is how the final SQL query should look like:

DELETE FROM "families"
WHERE "families"."id" IN (
    SELECT "plates_families"."family_id" AS r$22, count(*)
    FROM "plates_families"
    GROUP BY r$22
    HAVING ("count"("plates_families"."plate_id")) = 0
    )

How would you get this from quince if you had to write the quince::query from scratch. I will post a minimal example in a few minutes.

btw: https:// access to quince-lib.com results in a security warning from both Firefox and Chrome. The certificate is both self-signed and expired :(

oranja commented 8 years ago

Following is a minimal example you can build & run to recreate the problem easily: Makefile

CXX=g++
CXXFLAGS=-std=c++11
LDLIBS=-ldl -lquince -lquince-sqlite -lboost_chrono -lboost_date_time -lboost_filesystem -lboost_system -lboost_thread

all: quince_having movies.db

%.db:
    touch $@

clean:
    rm -f quince_having *.o *.db

.PHONY: all clean

quince_having.cpp

#include <string>
#include <quince/quince.h>
#include <quince_sqlite/database.h>

using namespace quince;
using namespace std;

struct movie_t {
    string name;
    string director;
};
QUINCE_MAP_CLASS(movie_t, (name)(director))

int main() {
    const quince_sqlite::database db("movies.db");
    table<movie_t> movies(db, "movies");
    movies.specify_key(movies->name);
    movies.open();

    // Insert test data
    movies.insert({ "The Phantom Menace", "George Lucas" });
    movies.insert({ "Attack of the Clones", "George Lucas" });
    movies.insert({ "Revenge of the Sith", "George Lucas" });
    movies.insert({ "The Fellowship of the Ring", "Peter Jackson" });
    movies.insert({ "The Two Towers", "Peter Jackson" });
    movies.insert({ "The Return of the King", "Peter Jackson" });
    movies.insert({ "Goodfellas", "Martin Scorsese" });
    movies.insert({ "The Departed", "Martin Scorsese" });

    // The problematic query.
    // Should produce a GROUP BY...HAVING sql query. Instead it produces a WHERE...GROUP BY sql query.
    const exprn_mapper<string> name_exprn = movies->name;
    const exprn_mapper<string> director_exprn = movies->director;
    const exprn_mapper<int64_t> count_exprn = quince::count(name_exprn);
    const auto trilogy_directors =
        movies
        .where(count_exprn == int64_t(3))
        .group(director_exprn)
        .select(director_exprn);

    // Should output "Peter Jackson" and "George Lucas"
    for (string director : trilogy_directors) {
        cout << director << endl;
    }

    return 0;
}

Then ...

$ make
g++ -std=c++11    quince_having.cpp  -ldl -lquince -lquince-sqlite -lboost_chrono -lboost_date_time -lboost_filesystem -lboost_system -lboost_thread -o quince_having
touch movies.db

$ ./quince_having
terminate called after throwing an instance of 'quince::dbms_exception'
  what():  dbms-detected error: SQL logic error or missing database (most recent SQL command was `SELECT "movies"."director" AS r$8 FROM "movies" WHERE ("count"("movies"."name")) = (?1) GROUP BY r$8')

The expected query should look something like this:

SELECT director FROM movies GROUP BY director HAVING count(movies.name) = 3

Which actually returns:

# | director
1 | Peter Jackson
2 | George Lucas

But instead, quince composes a query of this form:

SELECT director FROM movies WHERE count(name) = 3 GROUP BY director

Which is malformed.

Error while executing SQL query on database 'movies': misuse of aggregate: count()

Also, you suggested to use distinct(), but I can't see how this should solve the issue. This returns the same error as before (misuse of count()):

SELECT DISTINCT director FROM movies WHERE count(name) = 3

And more generally, DISTINCT is not exactly what we want here.

SELECT DISTINCT director, count(name) FROM movies

Returns:

# | director | count(name)
1 | Peter Jackson | 8

Whereas this returns a more logical result:

SELECT director, count(name) FROM movies GROUP BY director

# | director | count(name)
1 | George Lucas | 3
2 | Martin Scorsese | 2
3 | Peter Jackson | 3

oranja commented 8 years ago

Major progress!

Previously I didn't notice that quince produced a different SQL query if I put the where() after select(). Now I see that:

    const auto trilogy_directors =
        movies
        .group(director_exprn)
        .select(director_exprn)
        .where(count_exprn == int64_t(3));

Outputs:

terminate called after throwing an instance of 'quince::dbms_exception'
  what():  dbms-detected error: SQL logic error or missing database (most recent SQL command was `SELECT r$8 FROM (SELECT "movies"."director" AS r$8 FROM "movies" GROUP BY r$8) AS q$1 WHERE ("count"("movies"."name")) = (?1)')

Eventually I got to the bottom of what you meant when you referred to Visibility... Putting into work all the recent insights, I am happy to discover that the following works as intended:

    const auto trilogy_directors =
        movies
        .group(director_exprn)
        .select(director_exprn, count_exprn)
        .where(count_exprn == int64_t(3))
        .select(director_exprn);

:gift:

Only that, still with highest respect for your excellent work, I would call this a workaround and not a solution. HAVING is as standard SQL as they get, AFAIK.

EDIT: ~~Now I'm trying to figure out why the workaround from above doesn't work in the actual code. So far I found some embarrassing mistakes in my code, but couldn't solve the bigger issue, nor have I managed to reproduce the same problem in the minimal example (which is not so minimal anymore). Bottom line: there might be something more in play, I will report back later.~~ Caught the culprit. It was me putting the same code in two separate places and forgetting about the second. DRY DRY DRY.

mshepanski commented 8 years ago

First, let me clear up a couple of side issues.

You can disregard my remark about distinct(). While it's true that, in general, q.group(f).select(f) is equivalent to q.select(f).distinct(), I have since realised that you need to call select() with two arguments, so that point doesn't apply.
You can disregard my point about optimization. I was looking at the first version of your code, where the query inside in() was of the form q.group(f).select(f), but that is not the version you'll be ending up with.

Onto more important things. Yes, it really does matter whether you put the where() before or after the group()-select(). When you say T.where(pred).group(g).select(e), it's exactly the same as writing this:

const auto simple_query = T.where(pred);
const auto complex_query = simple_query.group(g).select(e);

-- which I now see is not what you want. You want to filter out records from the result of a group()-select(), and quince lets you say exactly that, by constructing a query with group()-select() and then applying where() to it. I realise that this is different from SQL's approach, which distinguishes the two ideas by using the two words WHERE and HAVING. I've taken the approach I have because it allows me to make clear statements about what q.where(pred) does for any q, but I fully recognise that this is a departure from the SQL way, and can surprise people who expect quince syntax to be closer to SQL's.

Now a more technical issue, which shouldn't affect you as a quince user. Quince never actually generates HAVING. It could, but for various reasons (my own laziness, plus not seeing much downside) I decided to do it all with WHERE. If quince needs to apply WHERE to the result of SELECT-GROUP BY, then it sinks the SELECT-GROUP BY into a subquery and applies WHERE to that. I believe this achieves the same result as using HAVING and not sinking SELECT-GROUP BY into a subquery.

Sorry about the certificate problem. I'm not sure what's going on there. I'll look into it tomorrow.

And finally, just to confirm, is this issue (apart from the certificate) fully solved now?

oranja commented 8 years ago

With your statement that Quince won't generate HAVING clauses, by design rather than by oversight, I consider this resolved. Hopefully this conversation will serve more in the future to overcome the difference between Quince and SQL, or you'll find some middle ground that makes Quince more intuitive to newcomers from the SQL paradigm without compromising any of your design principles.

Thank you for everything.

mshepanski commented 8 years ago

Just following up on the certificate issue ...

There are no secrets on quince-lib.com, and it doesn't ask for any of your secrets either, so I haven't installed an SSL certificate. It's an http site, not https.

If you try to access it with https then it defaults to the hosting service's self-signed certificate, which is what you're seeing.