Revisit case sensitivity of identifiers

vgapeyev commented 1 year ago

In short: What is currently implemented for identifiers in partiql-lang-kotlin:

Is not what is specified in SQL;
Not specified or described anywhere;
The implementation is hard to follow and understand, as it is not encapsulated, but rather spread around and duplicated at use sites places.

Ideally, we implement what SQL prescribes, while taking care of properly abstracting it. This will surely be backward-incompatible, but everyone's sanity can be worth it.

SQL

In SQL, there are two syntactic variants of identifiers: regular identifiers like foo and delimited/quoted identifiers like "fOO".

A regular identifier is considered case-insensitive and has the fully-capitalized variant as the canonical form.
A delimited identifier is case-sensitive.
The regular vs delimited distinction is only lexical; after parsing and canonicalization, all identifiers are essentially strings. This is to say that foo, Foo, fOo, FOO and, notably, "FOO" are all the same identifier, canonically being FOO. This identifier is distinct from "foo", "Foo", "fOo", all of which are also distinct from each other.
The equivalence of identifiers is the equality of these canonicalized strings, in case-sensitive way.
Either lexical kind of an identifier (regular or delimited) can be used either for binding a variable (as in FROM or GROUP BY) or as a reference variable to something bound elsewhere.
Importantly, it is ok to name an object "FOO" and then refer to it as FOO or foo and, conversely, name an object gOO and then refer to it as "GOO".

partiql-lang-kotlin

In contrast, in the current implementation:

While there are lexical regular and delimited identifiers of the same lexical appearance as in SQL, there are three kinds of identifiers internally. The first distinction is based on whether it is an identifier used as a binder vs used as a reference.
For a binder, the case of characters in the lexical appearance is preserved semantically, while the fact whether the identifier was regular or delimited is discarded. The binder identifiers are used as keys for things like environments.
For a reference identifier, a delimited identifier becomes "an identifier with case-sensitive lookup", while a regular identifier becomes "an identifier with case-insensitive lookup". For the latter, the canonical representation is lower case.
The lookup of a reference identifier in an environment depends on whether it is case-sensitive or case-insensitive.
The case-sensitive lookup is straightforward, using case-sensitive string equality.
The case-insensitive lookup uses case-insensitive string equality. It can give an ambiguous result (reference Foo matching binders FOO, Foo, foo) -- apparently, identifiers that are equivalent in case-insensitive sense can bind to different entities, without shadowing.

[Disclaimer: the above is an attempt at a rational reconstruction of the design in place; no claim is made about its fidelity to the intent or to all intricacies of the implementation.]

Known history

Lower case was chosen as the canonical case in order to conform with PostgreSQL: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

Conclusion

While the SQL's design is not an exhibit of simplicity, the approach in p-l-k is even more complicated. While the two seem to coincide in common usage (say, only non-delimited identifiers and no malicious introduction of ambiguous binders), the differences are significant and break SQL conformance claims.

johnedquinn commented 1 year ago

Adding some background from SQL-1999, section 20.1:

The representation of an \ in the base tables and views of the Information Schema is by a character string corresponding to its \ (in the case of a \) or its \ (in the case of a \). Within this character string, any lower-case letter appearing in a \ is replaced by the equivalent upper-case letter, and any \ appearing in a \ is replaced by a \. Where an \ has multiple forms that are equal according to the rules of Subclause 8.2, ‘‘\’’, the form stored is that encountered at definition time.

The specification opts for using uppercase

RCHowell commented 1 year ago

Related, we will need to reify the case-sensitivity of binders in the AST so that when we pretty-print we will know to quote or not to quote (in the case of keywords). However, the quotes should not matter for binders as we use that literal value regardless of quotes.

Today's behavior is what one would expect, that is SELECT x as a FROM T to output << { 'a': .. } ... >> rather than << { 'A': ... } ... >>

dlurton commented 1 year ago

TBH, I'm not sure where this is coming from. The SQL-92 specification AFAICT is clear about it in section 5.2:

14) Two <delimited identifier>s are equivalent if their <delimitedidentifier body>s (with all occurrences of <quote> replaced by and all occurrences of <doublequote symbol> replaced by <doublequote>), considered as the repetition of acharacter string literal> that specifies a <character set specification> of SQL_TEXT and an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, "<comparison predicate>

RCHowell commented 2 months ago

@yliuuuu could you please provide an update here as to why #1519 was closed?

partiql / partiql-lang-kotlin