jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.55k stars 645 forks source link

Implement a more SQL select like query capability #576

Open lwhite1 opened 5 years ago

lwhite1 commented 5 years ago

preliminary notes for creating a way to query with a SQL select-ish approach

SQL-like API extensions

SQL is a familiar language with a powerful select statement for queries. Building on the changes above, we could provide a syntax that is "SQL-like", that provides some additional capability. It works like this:

Select method

Table would implement a static method called select(), with several overrides:

Projection select();
Projection select(String ... columnNames);
Projection select(Column ... columns);
Projection select(List<Column> columns, Column ... extraColumns);

The Projection class represents a set of columns derived in some fashion from the source table. This extends Tablesaw to enable queries that return tables containing a subset of columns as well as a subset of rows.

The no-arg select() version returns a Projection containing all the columns of the source table.

The last version select(List<Column> columns, Column ... extraColumns); returns a list of columns (often this will be all the columns in the source table) plus an array of 0 or more additional columns. This version provides two benefits:

  1. It lets you specify columns as a list, rather than as an array
  2. It lets you supplement the list by adding columns derived by applying map functions to one or more of the source table's original columns, just as SQL provides a means to create columns as functions of other columns. For example:
select(product, 
       qty, 
       price, 
       qty.multiply(price).setName("extension"),
       num("extension").multiply(.08).setName("tax"));  

Note: we should provide as(String name) as a synonym for setName() since it's both shorter and more SQL like.

From method

The from() method is implemented on Projection and specifies a table which serves as the source table for the query. Note that we allow only one source table, but it can be provided via a function like join().

The return value of from() is a ReifiedProjection which is a Projection that has a table and so is ready to be queried.

// definition as implemented on Projection
ReifiedProjection from(Table t);

// usage examples
ReifiedProjection rp;
rp = select().from(myTable);
rp = select().from(myTable).where(num("Age").isLessThan(10)));

Where method

The where() method implementation is the same as the Function<Table, Selection> version recently implemented in Table, and reimplemented on ReifiedProjection.

// definition as implemented on ReifiedProjection
Table where(Function<Table, Selection> deferredSelection);

// usage example
Table t = select().from(myTable)
  .where(num("age").isLessThan(10));

Aggregation, subgroups, and the having method

Aggregation may be performed using the existing Summarizer functionality that underlies the summarize() methods in Table.

// definition as implemented on ReifiedProjection
Table where(Table t);

// usage example
Summarizer summarizer = select().from(myTable).summarize("age", "height", mean);

// or converting directly to a table
Table summary = select().from(myTable).summarize("age", "height", mean).by("state");

In the above, by() returns a table by calling creating separate summarizers for each state and calling apply() on each summarizer to get the state-specific summaries, and aggregating them into a single table. We extend the functionality (and make it more SQL-like) by providing "groupBy()" as a NEAR synonym. The difference being that groupBy() does not execute the query directly. You can call apply() to make it work identically to by(). The following return exactly the same results:

Table summary = select().from(myTable).summarize("age", "height", mean).by("state");
Table summary = 
  select().from(myTable).summarize("age", "height", mean).groupBy("state").apply();

Which allows us to add a having() method to summarizer that lets us filter the results before returning the final table. This could potentially save a considerable amount of memory.

Table summary = 
  select().from(myTable)
    .summarize("age", "height", mean, median)
      .groupBy("state")
        .having(num("mean height").isGreaterThan(40));

Having calls apply() behind the scenes to execute the query.

It seems to make sense that select() when used with a summarizer only selects the columns that make sense in the aggregate. That is, any columns that are summarized, plus any columns used in the groupBy() clause. This would be basically the same as SQL I believe.

It could be made more SQL like by overriding select to take the parameters passed into the summarize clause above. The fact that there are functions passed along in the argument list would be enough to show that it's a summary that is desired. This would look like:

Table summary = 
  select("age", "height", mean, median)
      .from(myTable)
          .groupBy("state")
              .having(num("mean height").isGreaterThan(40));
ryancerf commented 5 years ago

This is great. I really like the direction you are moving in.

It looks like we have two different ways to get GroupBy like functionality. splitOn and summarize. Do they have separate purposes? Should one of them be deprecated in favor of the other? Curious to hear your thoughts.

lwhite1 commented 5 years ago

Summarizer is built on split-on. It does all the work of combining the results of the grouped operations, as well as making it easier to formulate the inputs to split-on.

Split-on provides the low-level support, and can be used for basically anything that requires subtables, but for calculating basic stats summarize is much easier.

Neither should be deprecated.

lwhite1 commented 5 years ago

To expand a little on my last comment. It's not clear which would be more useful for these extensions, should they get implemented.

lwhite1 commented 5 years ago

@ryancerf I made a few edits to the quasi-spec above. The main thing is that we could make a simple change that makes it really sql like. This is the basic idea:

select() when used with a summarizer only selects the columns that make sense in the aggregate: any columns that are summarized, plus any columns used in the groupBy() clause. This would be basically the same as SQL I believe.

It could actually be made more SQL like by overriding select to take the parameters passed into the summarize clause. The fact that there are functions passed along in the argument list would signify that a summary is desired. This:

Table summary = 
  select().from(myTable)
    .summarize("age", "height", mean, median)
      .groupBy("state")
        .having(num("mean height").isGreaterThan(40));

then becomes:

Table summary = 
  select("age", "height", mean, median)
      .from(myTable)
          .groupBy("state")
              .having(num("mean height").isGreaterThan(40));

The resulting table has five columns: state, mean and median age, mean and median height.

benmccann commented 5 years ago

This reminds me quite a bit of Jooq. You may be interested to check out its API for ideas

lwhite1 commented 5 years ago

Good idea. I used it quite a bit on a project a few years ago

On Thu, Aug 8, 2019 at 5:56 PM Ben McCann notifications@github.com wrote:

This reminds me quite a bit of Jooq. You may be interested to check out its API for ideas

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/576?email_source=notifications&email_token=AA2FPAQJBC4LUNXSBGA2X2TQDSI7DA5CNFSM4IKCUTD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD35AMEI#issuecomment-519702033, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2FPAT6BAC2XYBZYV7NTV3QDSI7DANCNFSM4IKCUTDQ .

ryancerf commented 5 years ago

Your second example is more SQL like, but this is one area where I personally do not like SQL's syntax. I have taught a number of people basic SQL and they often get hung by the fact that the Group By clause comes after the Select clause, but logically happens first. If I could rewrite SQL I would put the clauses in the following order.

For that reason I like the existing splitOn syntax.

table.splitOn("state", "gender").aggregate(...)

Off the top of my head my preferred SQL like API is.

table.query()
  .leftJoin(genderTable).using("id")
  .where(intColumn("age").isGreaterThan.(18))
  .groupBy("state", "gender")
  .aggregate("age" mean, "height", median) // Alternating is more sql like.
  .having(num("mean age").isGreaterThan(40))
  .orderBy("mean age", ASC, "median height" DESC) // Alternating is more sql like
  .execute();

Couple of other ideas.

In the long run I think this kind of an API has the possibility to make it clear to users where to go for basic SQL like data transformations and really improve the library. It might even turn out to be tablesaw's the killer feature. I am excited to help, no matter what API we decide on.

lwhite1 commented 5 years ago

Wow. You have given me a lot to think about.

Some comments and questions below.

If I could rewrite SQL I would put the clauses in the following order.

This is more logical, and the syntax is clearer, but whenever I think about a better query syntax, I can't help thinking of QUEL. It was an improvement on SQL but never took off. But since we're not building a database here, we have more flexibility.

table.query()   

Does query have the same column-specification capabilities as select() in my example? On execute it returns a column subset of table?

  .leftJoin(genderTable).using("id")  

Also returns a projection (now containing the join object), presumably, so we can skip it if it's not needed. We don't have the ability specify columns on join statements currently, afaicr. So we could add that or live without it.

  .where(intColumn("age").isGreaterThan.(18)) 

Normally this would return a table, but we could defer that. It would return the same object as above so it can be skipped. Maybe that object should be called Query instead of projection.

If this were really SQL, where would have to support sub-queries, and operations like EXISTS. I've been thinking about that a little, but no ideas yet

  .groupBy("state", "gender")  

Returns a TableSliceGroup on execution? And a Query object before that? This is problematic when in this order, since it would mean returning something other than a table on execution. It could be reversed with aggregate() so it also returns a table on execute(). Or we would throw an exception if execute were called here.

If we're building a query object here, does the order matter, or is something more flexible like the traditional Java builder work also? I'm starting to think this is a Query builder

  .aggregate("age" mean, "height", median) // Alternating is more sql like

As in SQL, this could conflict with the results of the projection and join steps. Once you aggregate by groups you're limited in what columns can be returned. We could filter the columns the requested columns here, or throw an exception if something is explicitly included above that can no longer be returned.

I'm not sure if this is easily implemented over the current aggregation code.

Java being Java, we'd have to either combine the pairs into objects or accept an Object[], or accept a big string like SQL and parse it. (yuck)

It's more verbose if you wanted 3 aggregates on one column or vice-versa. On the other hand it's more granular than what we have. FWIW, I don't love our current syntax on summarize() either. Can we do better than either approach?

  .having(num("mean age").isGreaterThan(40))

An exception if there's no group by specified, maybe. Although it wouldn't break as long as there is an aggregate clause. Probably better to allow that for generality

  .orderBy("mean age", ASC, "median height" DESC) // Alternating is more sql like

Our current sort options are better, I think, especially with the ability to do: "+mean age", "-median height", or to create a Sort object. This has the same syntax issues as the SQL-like aggregate above. Maybe we should consider creating an Aggregation object that's like the Sort object - with it's own builder.

  .execute();

It would also be great if we could defer the execution of the entire query until we call execute.

Yeah that makes complete sense for something like this. It really is starting to feel like the builder approach used in jsplot.

I am experimenting with building a fluent window function (aka analytic function) query builder

Now I have to go read the link on analytic function :)

A fluent interface could effectively guide users from clause to clause (method to method) making it very easy to use. Give me a couple of days and I will commit the experimental fluent analytic query builder to my fork of tablesaw. You and Ben can then clone it and try it out.

Sounds good. Is it an alternative to this, or the same thing?

ryancerf commented 5 years ago

Yes, this is builder approach.

Yeah that makes complete sense for something like this. It really is starting to feel like the builder approach used in jsplot.

It is totally the builder approach. I should have stated that explicitly. Each step would be building up a query object.

Something like this would return a POJO with data explaining the transformation.

table.query()
  .where(intColumn("age").isGreaterThan.(18))
  .orderBy("mean age", ASC, "median height" DESC)
  .build();

QUEL

This is more logical, and the syntax is clearer, but whenever I think about a better query syntax, I can't help thinking of QUEL. It was an improvement on SQL but never took off. But since we're not building a database here, we have more flexibility.

Absolutely. I have never even hard of QUEL! I am talking about moving SELECT father down. It is a pretty small change and even LINQ does something similar.

I think there is a super important lesson here. This is my design approach.

Our goal in adding SQL like functionality to tablesaw is NOT to create a better version of SQL it is to create a familiar syntax for basic data transformations that users already understand.

When faced with a tradeoff between "better" syntax and syntax that is more SQL-like be weary of picking the "better" syntax, because most of the benefit of the the SQL-like syntax is that it is standard and familiar.

Deferred Execution

Does query have the same column-specification capabilities as select() in my example? On execute it returns a column subset of table?

What the final query returns depends on the whole query. That method would return a query builder object.

Also returns a projection (now containing the join object), presumably, so we can skip it if it's not needed. We don't have the ability specify columns on join statements currently, afaicr. So we could add that or live without it.

Same as above. Would return a query builder.

.aggregate("age" mean, "height", median) // Alternating is more sql like

There are a bunch of options. Here is one that is similar to what I am doing in my analytic function experiment.

.query()
  .groupBy("region")
  .sum("age").as("meanAge")
  .count("age").as("ageCount")
  .build()

Fluent Interface

Sounds good. Is it an alternative to this, or the same thing?

The fluent interface is a design pattern for how to effectively implement the proposed syntax. I recommend reading up on the design pattern.

.having(num("mean age").isGreaterThan(40))

An exception if there's no group by specified, maybe. Although it wouldn't break as long as there is an aggregate clause. Probably better to allow that for generality

If we are using a fluent interface then you could never get to having unless you specified a group by.

Big picture.

There is a ton to think about and we need to be careful that we do not end up implementing a full blown query analyzer etc.

As a next step I think we should work on #582. It is a much smaller easier to handle (although non-trivial). We can work out a lot of the design decisions there and see how it goes before tackling this much more complicated problem.

shashvat-kedia commented 5 years ago

@lwhite1 Can I work on this?

ryancerf commented 5 years ago

Interaction between SQL like query capability and DeferredColumn

I was giving a little bit of thought to how the DeferredColumn features would interact with the SQL like query capability and we may want to make some changes to the deferred column implementation before we release it. #584

In the future I would like something like this to be possible:

table.query().where(str("co1").isEqualTo("USA")).build().toSqlString();

to return:

SELECT * FROM myTable WHERE col1 = "USA";

With the current implementation of DeferredColumn this is not possible. I have not thought this all the way through yet, but I think this would be done if methods like.isEqualTo on DeferredColumn as well as and on QuerySupport returned a custom class. Maybe BooleanValueExpression.

The BooleanValueExpression class could contain a Function<Table, Selection> as well as the information needed to recreate the col1 = "USA". Either way having the DeferredColumn methods return a Custom class will give us more flexibility in the future.

How should users access the SQL like functionality?

I think the API would be better if there was only one or two entry points to access the sql query capabilities rather than having the methods directly on table. Something like table.query() that returned a query builder object would make a good entry point.

That means we may want to wait to add the where(Function<Table, Selection> selection) method to table and make users access it via table.query()

@lwhite1 What do you think?