Open jeromesimeon opened 8 years ago
I have a possible approach to doing this incrementally and want to try it out. Have discussed the idea with @jeromesimeon. The basic idea is to exploit the fact that SQL++ is a superset of SQL and start by replacing our presto dependency with a dependency of the SQL++ parser in AsterixDB. That is an open source project in Apache, provided under the Apache License. Initially, the SQL++ visitor will only parse as much of SQL++ as our presto visitor does (some subset of the SQL subset of SQL++), and will produce the identical S-expression form. The sanity check at this point is does it actually support at least as much SQL as before? Maybe (possibly?) more? Then, we add parsing support for SQL++ extensions. Initially, the qcert phases starting from the ingestion of the s-expression will choke, but then we can gradually extend support through the compiler phases, linking up with existing richer expression support in the intermediate languages.
I will add a task list to the description based on this idea. I've already done a first step, which is to acquire the open source code and organize it to satisfy our likely dependency on it. Code that is actually dependent on it will be initially placed in a package that our 'make' step does not try to compile and that eclipse metadata will explicitly ignore, so others in the group will not have to duplicate my tentative setup until it is more mature. This crude approach will be replaced by an actual git branch as soon as things become complicated enough to require it.
I have added some rough instructions for reproducing my setup in case my plane gets sucked into the Bermuda triangle next week. I don't expect anyone to actually try it but it's there in case you want to.
The new SQL front end based on SQL++ now supports the easy queries in samples/sql but supporting the TPC-H and TPC-DS queries is going to be a tougher proposition.
Currently, only four TPC-H queries are parsed by the AsterixDB code, at least as I am now using it. These are queries 2, 17, 18, and 21. It will probably be easy to get them to convert to S-expressions. Of the ones that fail to get through the parser, many fail because of the date literal syntax, e.g.
date '1998-12-01'
which asterixDB SQL++ does not accept. There are 8 such (1,3,4,5,6,10,14,20). It seems like this kind of literal is NOT used in TPC-DS, which may use a different convention for date literals. If that's true, there might be a lexical fixup that would work. The remaining 10 failures have various causes, which will have to be tracked down.
There are 33 TPC-DS queries that get through the asterixDB parser (as opposed to 94 that got through Presto), but these are without using any of the lexical fixups that I used to get to 94 with presto. My guess is that asterixDB will require a different set of lexical fixups but perhaps some will be the same. AsterixDB uses JavaCC rather than ANTLR, so teasing apart the lexer and parser, as I did before, will have to be done differently.
The four TPC-H queries that get through AsterixDB unscathed now get converted to S-expressions identical to what the Presto-based version does. This is in addition to the small examples in samples/sql.
Hah. In $A/asterixdb/asterix-benchmark/src/main/resources/benchmarks/tpch/queries
are edited versions of the TPC-H queries which apparently work (haven't looked into it in detail). I don't know how heavily edited the queries are or whether the changes invalidate them as benchmarks.
When I resume work on 5/15 I will use these versions along with the "vanilla" originals to guide me. Meanwhile, I'm off for a week with no computer.
It appears that what AsterixDB has in that TPC-H queries directory are queries semantically equivalent to the ones we generate from the templates but different in a number of details. In some cases, they just take advantage of some SQL++ features. For dates and date intervals, it appears that they edit more heavily. Unfortunately, in the area of dates and intervals the actual SQL standard(s) are too weak (in addition to being too numerous) to be able to count on anything. But, AsterixDB does have support for dates and date intervals, leading me to believe that some judicious lexical fixups will result in our being able to support a larger number of TPC-H queries out of the box.
Whether or not I should be working hard to support the TPC-H and TPC-DS dialects of SQL is another matter. It appears that SQL++ is (1) a semantic superset of SQL, (2) a syntactic superset in most respects, but (3) perhaps not an exact syntactic superset. To the extent that it deviates from exact syntactic incorporation, I suspect the changes are good things, not to be fought against. For example, the expression is a top level production in SQL++, and the language generally appears to be more compositional than SQL.
Notwithstanding the previous, I tend to be stubborn when I get some beast by the tail, so I'm going to persist a bit longer in the exercise of seeing how much of what we supported via Presto can also be supported via AsterixDB/SQL++.
Current scorecard for TPC-H:
Queries 1, 2, 3, 4, 5, 6, 10, 14, 17., 18, 20, and 21 (12 queries) now successfully go through the new support producing the identical S-expression to what we got with the old support.
All remaining queries fail to parse under the AsterixDB SQL++ parser for one reason or another. That is, the new visitor handles all the cases that actually parse.
Queries 7, 8, and 9 fail on expressions to extract a component from a date. This should have been covered by the lexical fixups but I missed it ... this should be remedied shortly.
Queries 12, and 19 currently fail on something to do with the 'in' clause. I will investigate.
Query 15 fails because SQL++ appears to lack a "create view" statement in its own right. As I said in the previous comment, SQL++ appears to be a semantic superset of SQL and "close to" a syntactic superset, but the creation of "views" (or collections bound to variables) is done differently. I will study the matter.
Query 16 fails over something involving count(distinct ...), which probably has an alternate form.
The remaining queries (11, 13, 22) fail for reasons too mysterious at this point to characterize.
We now support TPCH queries 7-9 in addition to the 12 previously supported. So, we are now at 15 supported, 7 not. The problem cases remain 11, 12, 13, 15, 16, 19, and 22.
We now support TPC-H queries 12 and 19 in addition to the 15 previously supported. So the current score is 17 yes, 5 no. The latest lexical fixup demonstrates how SQL++ improves on SQL without necessarily being strict syntactic superset. In standard SQL, a parenthesized list of expressions following the 'in' keyword is special and denotes a list semantically, but such lists are not general first-class expressions in standard SQL. SQL++ has a first-class list constructor, but it uses square brackets (not surprisingly, to avoid ambiguities and for "JSON-friendliness") and so 'in' is just an ordinary infix binary operator between two expressions.
I have fixups that enable queries 11 and 22, and one in progress that should enable 16. Once that is pushed we will have 20 TPC-H queries working out of the 22, leaving us with the two hardest cases.
Query 15 has a 'create view' in it which needs to be replaced with either a 'with' or a 'let'. This might require a fair amount of thinking to get right at the lexical level.
Query 13 performs an aliasing of a relation that assigns names not only to the relation but to its columns in a single 'as' clause. SQL++ does not appear to support this and I'm not sure what the equivalent should be.
But, even without these, the last few lexical fixups that I've added have exposed the potential silliness of carrying this exercise all the way to its conclusion. The fixups work with TPC-H and would tend to work "usually" but are vulnerable to outlying cases because they are operating at the level of the lexical grammar, with (probably incomplete) local approximations of the syntactic grammar in a "flattened" form (that is, with no attempt to construct an AST). In fact, SQL++ is not a true syntactic superset of SQL so expecting it to parse all possible SQL queries is counterproductive.
After the next push, I'd be inclined to declare victory on the current experiment and revise the task list with the understanding that SQL and SQL++ are different languages with a highly overlapping core and focus on a "complete" (enough for our purposes) encoding of SQL++ in S-expression form which is NOT expected to cover as much of SQL as the Presto-based encoder. How this distinction is handled in the driver and internally to the Coq/OCaml code will require some discussion.
I've updated the task list and I've checked off some items. The plan is now to treat SQL++ as a distinct language, but, until I've figured out how to do that (or gotten some guidance on the subject) I'll continue to blur the distinction for a while. The key point is that an AsterixDB/SQL++ based encoding and a Presto-based standard SQL encoding will live side-by-side, and I'll start structuring the code with that in mind.
The restructuring of the previous comment is largely complete in the Java code (there is, as yet, no separation between SQL and SQL++ as source languages at the level of the qcert command). I won't declare victory until the languages at least "appear" to be first-class and separate, although they will probably continue to merge somewhere down the toolchain until I have more complete parsing of SQL++.
Duplicating my results is now easier than before (because I've created the semi-familiar getDependencies.xml
script) but still requires eclipse. I'll add a Makefile soon and also improve the top-level test main program so you can get a sense of where we stand without reading all the code.
There is now a Makefile to build the code but not yet one to run tests.
I do run tests (under Eclilpse; Makefile coming soon), and I've started using examples that come with the SQL++ distro. One such is
$A/asterixdb/asterix-app/src/test/resources/parserts/queries_sqlpp
where there are 50 miscellaneous unit tests related to the parser. Three of these I discarded because they exercise the syntax to create and delete databases and types rather than query syntax so I consider them out of scope. This leaves 47, all of which get through the parser (of course) but I am using them to drive "completeness" of the S-expression encoder. I am now covering 27 of the 47. Of course, even when I get to 47 out of 47, new clauses will be needed in AstsToSExp.ml in order to properly consume the S-expressions and even then there will be no translation to NRA until those are filled in also.
I think the point at which I start adding to AstsToSexp.ml would be the point at which the two languages need to fork internally with common subfunctions for the overlapping subsets. I will need some advice once I get to that point.
All 47 unit tests from the set now "encode" but I have no idea whether it is correct (even whether parentheses balance). Not sure whether the next step is to push what I have so far through AstsToSExp or whether to continue to assess and widen coverage.
All 47 tests now get through the S-expression grammar in qcert but fail (expectedly) when the SQL translation from sexp to AST is attempted. That failure can be explained as due to the presence of new constructs that are in SQL++ but not SQL.
The ability of those other than me to actually run these tests is almost there but might still require some handholding. I will finish the automation on the next push.
I am also inclined to spend a little more time on coverage of SQL++ by the encoder so that I can have a good overview of where the two languages overlap and diverge before attempting to split them within qcert. I don't think it would be a good idea to start hacking AstsToSExp.ml until they are properly split because that is likely to regress our support for SQL.
At this point, coverage by the encoder is probably adequate. I'm now using 69 unit tests (the 47 from before plus the 22 SQL++ translations of the TPC-H suite. Note that we previously supported 20 out 22 of these in their original form (through lexical fixups) and these produced the identical S-expressions as Presto. But, the SQL++ versions differ just enough that only q15 gets all the way through the compiler.
The number of "not implemented" cases had gotten small enough that I went through manually and found that most of them could be labelled "not supported" (that is, we probably won't ever support them) because they are nothing to do with queries. They are 'create', 'drop' and mutating operations on the database. Of the remaining 'not implemented' cases it appears that most of them won't ever occur (because they involve nodes that the parser does not currently create and that are not documented). I fixed an omission that caused 'with' and 'let' clauses to be ignored.
Currently all 69 produce S-expression forms that qcert can 'parse' (they conform to the general s-expression rules) but that the functions in AstToSExp.ml can't consume (yet).
I've create a branch (jsa-sqlpp). It may not appear until my first push on it but I don't plan to make many changes in the master for a while, since the next steps are potentially disruptive.
The plan is to clone the language SQL (first in Coq, then in OCaml) as a new language SQLPP which will start out with "the same" AST (simply renamed) and either no translations at all or a clone of the SQL translation to NRA. This will allow me to separate two languages all the way out to the test cases while initially SQL++ is no further forward than it was.
Next, I'll elaborate the Coq AST for SQL++ to be complete and fill in the translation from S-expression form to that AST. This will simply move the first point of failure to the translation steps. At that point we'll be ready to consider the best translation from SQL++ to our ILs. Yes, that's when it starts to get hard and when I'll start to need help.
On the branch jsa-sqlpp, the qcert command supports -source sql++
and the tests driven from the Makefile in sqlppParser now exercise that path. The behavior is (at the moment) almost the same whether you specify -source sql
or -source sql++
. However, the languages are now distinct internally. Also, I decided not to clone the sql->nraenv translator for SQL++ and instead provided a translator for sql++->nraenv that just returns a "non-implemented" error. Notwithstanding the previous, the compiler driver believes that the translation exists and all its proofs seem to work on that assumption.
As noted previously, the AST for SQL++ at the Coq level is just a renamed clone of the SQL one. My intent is to start augmenting it so that it can mirror the Java level AST at least up to what is exercised by the existing tests.
It could be argued that this is not "the right" approach because the derivation of SQL++ from SQL did not follow the path of adding syntactic constructs to SQL (see https://arxiv.org/abs/1405.3631). So, what we'll have by accreting constructs on top of a SQL grammar is something a lot messier than we would get a from-scratch SQL++ grammar, which would distinguish an expression-oriented core from a sugaring layer that brings the syntax closer to SQL 92. The semantics of the core are clearer and the translation to NRA should be more obvious.
But I defend the approach taken as follows:
Consequently, I think it is a virtue that the Coq AST will end up mirroring the surface syntax as parsed by the Java code, although it might turn out to be important to translate this into pure core SQL++ before going from there to NRAenv.
It's been a while since I've pushed changes, but I'm actually still working on this. I ended up completely disagreeing with the analysis of the previous comment once I really tried to do it. SInce the SQL AST is really not a good match to the SQL++ semantics, if you try augmenting it to match the SQL++ syntax, you end up with a mess which is not going to be workable for any real translations.
So I started over defining a SQL++ AST in Coq and the necessary veneers to build in in AstsToSExp. This AST does not clearly distinguish sugaring constructs from core constructs but it follows the expository grammar of SQL++ as contained in its documentation. The AsterixDB Java AST is a good structural match to this, but I will need to largely re-invent the S-expression encoding and rewrite the visitor. I think that will end up being closer to what we really want, though.
The key point is that "expression" and not "query" is the top non-terminal. A "query" is just one kind of expression (though it still has clauses, some of them optional, which makes it a pretty complex expression).
There is now a tentative Coq AST for SQL++ that is complete for the currently documented language grammar (at least as implemented by AsterixDB).
It does compile but might be grossly bad in other ways, since I am such a Coq novice.
Since I am now working "up" from the Coq AST instead of "down" from the front end, the Sexp translation is disabled and the current Java visitor is obsolete. However, the QLib layer is at least tentatively filled in. Some details, like size and depth functions, are incompletely implemented or stubbed out for the moment.
I think a new milestone would be to get a round trip coded from SQL++ to AsterixDB AST to S-expression to Coq AST and back to S-expression, also getting the size and depth stats . That's the best test I can do until actual translation to NRAenv is attempted, which I have not really thought carefully about yet.
The encoding visitor now uses a complete S-expression encoding of those parts of the AsterixDB AST that we intend to translate into Coq. The comments indicate the corresponding SQL++ (documented) grammar, the Coq AST clause, and the S-expression encoding rule for each visit method. In some cases, where the AsterixDB AST doesn't strictly follow the documented grammar this is noted as well.
Before starting the new SEXP->Coq AST translation, I paused to sanity check the current visitor.
There is a small "regression" relative to what worked before: we are now rejecting function declarations instead of encoding them, since I really don't know whether to attempt inlining them in the Java code, inlining them in OCaml, or inlining them them in Coq. I guess in an interpretive Coq environment they could be translated naively as functions and applied lazily, instead of inlining them at all. But I don't believe that will work if all the Coq code has to be statically extracted.
I am going to change the front end so that instead of passing through statements that we can't handle in the visitor it just elides them (so that the visitor is left just visiting query statements). This will allow more tests to be used as is. If the elided statements are function declarations the result will be to shift the problem to one of an unrecognized function name. That seems like the right balance.
There is now a s-expression translation from the S-expression encoding produced by the new visitor to the Coq AST for SQL++. All the tests now "work" (hah) but this is because I have
Definition sqlpp_to_nraenv (q:sqlpp) : nraenv := NRAEnvConst dunit.
as a vacuous placeholder in the compiler driver. Not surprisingly, all the generated Javascript looks like
function query(constants) {
return null;
}
One might argue that building a real translation to nraenv would be the next step but that seems to me to involve a careful exploration of SQL++ semantics which I might not do on my own. The next step for me, rather, is to write a sqlpp_to_sexp function so that the completeness of the Coq AST can be evaluated via round-tripping.
As promised, I am now testing the completeness of the Coq AST and its equivalence to the AsterixDB one by doing round-trips via s-expressions. All tests are passing. So, I think I can declare victory on the idea of at least having complete and accurate syntactic support for SQL++. There is still no NRAenv translation, so the support is a dead-end until one is added.
Meanwhile, the fact that size and depth functions are stubbed out is now a more visible omission (since everything else but the NRAenv translation is now there). So, I may try writing those before I disappear for a week at Bennington. When I come back from Bennington I will look at either getting help on the NRAenv translation or figuring it out on my own.
I didn't quite finish size and depth before Bennington but I just pushed them now. So, all syntactic support for SQL++ is in place, including the size and depth auxiliaries. The task now is simply that of coding a translation to NRAenv. Not that that is a small task.
SQL++ a relatively recent effort at providing a super-set of SQL which includes support for nested queries, and JSON data. It seems to be a an active project, with a relatively well documented semantics. Some aspects probably go beyond what we can currently support (the 'semistructured' part which ties into how flexible the input data can be), but possibly a good alternative syntax (compared to OQL) as a front end for our compiler.