SQL++ to NRAEnv Translation

joshuaauerbachwatson commented 6 years ago

This is a sub-issue of issue #10.

I am using it to record semantic issues that come up in writing the sqlpp_to_nraenv_top function and its subfunctions.

joshuaauerbachwatson commented 6 years ago

We current define sqlpp as list sqlpp_expr but this is actually counterproductive. It is modeled on what we did in SQL, where there are additional statement types besides queries ("create view" and "drop view"). Note that in SQLtoNRAenv there is the expectation that there can be multiple statements, but (generally) only one query statement. If there is more than one, only the last affects the result. The sexp translation for SQL also converts 'with' clauses into separate 'create view' statements.

Note that there is no create view statement in SQL++ but there is a much richer repertoire of 'with' and 'let' clauses that can bind variables to collections (or to anything, actually) with varied scopes.

joshuaauerbachwatson commented 6 years ago

I've simplified by redefining sqlpp as sqlpp_expr rather than a list. This has been drilled up through the front-end logic. Still no actual contents in the NRAenv translator.

joshuaauerbachwatson commented 6 years ago

Some of the more obvious clauses in the SQLPPtoNRAEnv are now populated. The biggest issue (select) is still pending. Also, even in the case of simpler expressions there are many open issues, indicated as TODOs in the code.

The following is not prioritized, just listing the issues.

SQL++ has two distinct values for unknown data: null (generally occurring when a JSON object has an explicit null value) and unknown (generally occurring when a field of a JSON object is not present but is referenced). In our data model, we can represent null but we don't have a distinct value for unknown. Also, we don't contemplate that dunit can take on any type, the way null and unknown can in SQL++. For object references in the past we have used optional types to model things that can be null. Nothing precludes using sum types more generally to model things that can have two different kinds of absence as well as being present with a value. However, this will complicate translation in a fairly pervasive way.
SQL++ has both bags and (ordered) lists but our data model has only bags. We actually use Coq lists to implement bags and I think in some places we cheat and take advantage of order. But, to be correct, we really need two distinct first-class collection kinds.
SQL++ has a first-class exponentiation operator. We have no equivalent base operator but we do have exponentiation in our floating point model part. This is probably not a serious issue but we do have to assume that floating point is present in order to translate SQL++ and we have to do whatever is needed to coerce to and from floating point in order to implement this operator.
The like operator in SQL++ is a binary operator between expressions; that is, the pattern does not have to be a literal but can be computed. Our internal like operator is unary and takes the pattern as a (string) parameter. We probably need a binary like operator either in the base or as a foreign operator.
SQL++ has a (JSON) object constructor in which both field names and field values are expressions. Our internal record type (the obvious translation target) requires static field names.

More generally, we probably won't fix all of these mismatches right away or even ever and so we need a clear statement of what features of SQL++ we don't support. Ideally, these usages will be filtered out with diagnostics in the front-end so the user doesn't get too confused.

joshuaauerbachwatson commented 6 years ago

We now have translations for many expressions. Still not implemented but planned:

quantified expressions (deferred until after query is done, since a lot of the mechanism of 'select where' can be reused)
query (the biggest part, about to be tackled)
exponent operator
indexing a list via a position
'index any'

I distinguish normal indexing from 'index any' because normal indexing will require solving the lists vs bags issue but 'index any' could be implemented without doing that. The semantics are a random selection from a collection (so order doesn't actually matter). On the other hand, 'index any' requires getting some form of randomness into Coq. I think it would be adequate to have a (not very good) pseudo-random function, which could be implemented in Coq, with the seed being a single global input to the computation that is then passed around and updated. A pain threading it through but otherwise straightforward.

Also, among things that are already translated, the following restrictions:

the second operator of 'like' and all field names in object constructions must be string literals. This is screened in the front end and the restriction is reflected in the Coq AST for SQL++
bag constructors and array constructors are translated identically, producing what are bags in our model (related to the absence of indexing)
equal and fuzzy equal are both translated as equal (I have not found a precise semantic definition of what fuzzy equal even means)
null, missing, and unknown are all translated as null and we do not use sum types to make comparisons type-correct, so (e.g.) a number might be compared with null directly
we are assuming variable references can simply be dot operators on ID, which means that all the 'let' and 'with' expressions in queries must concat bindings to ID and we are just assuming the scoping will all work (and are not using the environment as part of translation)

I would say the two largest issues requiring contributions from others are

distinguishing bags from lists
distinguishing null from missing and deciding how pervasively to use sum types in the translation

If the bag vs list distinction remains unaddressed for a while, I might consider ruling out normal indexing in the front end; I don't think we can rule out the 'array' constructor because it is probably used a lot rather than the bag constructor and often user's don't really care about the order. I think we just need to document what semantics we support.

If we are unable to address null vs missing and use of sum types for a while, we might also impose a restriction that rules out all forms of missing data (including null) but that will be perceived as a huge restriction eliminating some of the advantages of SQL++ for semi-structured JSON data.

joshuaauerbachwatson commented 6 years ago

Previous enumerations of progress and issues have left out the interesting case of function calls. The design I'm assuming is the following:

Function names are actually a fixed set governed by an "opt-in" list. Note that all user defined functions are illegal.
Legal function names and also correct number of arguments are known to the Java frontend and checked there so illegal function calls do not reach the Coq AST
In the Coq AST, this set is an enumeration (so function names are no longer strings). Pattern matching should be tighter and we are also guaranteed that there are the correct number of arguments (more subtle errors, such as type errors in the arguments, can still cause late failures).

I am partly along implementing this design. Functions that initially populate this set are

those specified directly in the SQL++ spec (the unrestricted aggregation functions of SQL++ such as coll_avg and array_avg and the SQL-92 compatible restricted aggregation functions such as avg which are provided as sugaring macros in SQL++).
those need to implement times and dates such as get_year
others used in test cases that seem like we ought to support them like substring

jeromesimeon commented 6 years ago

It would be nice to have a well-written mechanized semantics for SQL++. This would go hand-in-hand with a proof of semantic preservation for this translation (See #74).

jeromesimeon commented 6 years ago

I've just noticed the following Coq warning:

File "./coq/Translation/SQLPPtoNRAEnv.v", line 40, characters 0-6785:
Warning: Not a fully mutually defined
fixpoint
(sqlpp_to_nraenv depends on sqlpp_select_statement_to_nraenv but not
conversely).
Well-foundedness check may fail unexpectedly.
 [non-full-mutual,fixpoints]

It indicates that sqlpp_select_statement_to_nraenv could be a separate function. This may be a temporary thing, due to the current state of the translation, but I thought I would record it.

joshuaauerbachwatson commented 6 years ago

Commenting on the previous comment from @jeromesimeon: the sqlpp_select_statement_to_nraenv function is currently a highly incomplete stub but it is intended ultimately to be mutually recursive with sqlpp_to_nraenv (it will need to be). So, yes, it is a temporary thing. If the warning message is judged to be annoying, I could comment out the function entirely and stub out the call to it.

querycert / qcert

SQL++ to NRAEnv Translation #56