trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.27k stars 2.96k forks source link

Add support for case sensitive identifiers #17

Open martint opened 5 years ago

martint commented 5 years ago
<delimited identifier> ::=
  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::=  <delimited identifier part>...
<delimited identifier part> ::=
    <nondoublequote character>
  | <doublequote symbol>

<Unicode delimited identifier> ::=
  U <ampersand> <double quote> <Unicode delimiter body> <double quote>
      <Unicode escape specifier>
<Unicode escape specifier> ::=
  [ UESCAPE <quote> <Unicode escape character> <quote> ]
<Unicode delimiter body> ::=
  <Unicode identifier part>...
<Unicode identifier part> ::=
    <delimited identifier part>
  | <Unicode escape value>
24) For every <identifier body> IB there is exactly one corresponding case-normal form CNF. CNF is an <identifier body> derived from IB as follows:
Let n be the number of characters in IB. For i ranging from 1 (one) to n, the i-th character Mi of IB is transliterated into the corresponding character 
or characters of CNF as follows:
Case:
   a) If Mi is a lower case character or a title case character for which an equivalent upper case sequence U is de ned by Unicode, then let j be th
       e number of characters in U; the next j characters of CNF are U.
   b) Otherwise, the next character of CNF is Mi.
25) The case-normal form of the <identifier body> of a <regular identifier> is used for purposes such as and including determination of identifier 
      equivalence, representation in the Definition and Information Schemas, and representation in diagnostics areas.

...

27) Two <regular identifier>s are equivalent if the case-normal forms of their <identifier body>s, considered as the repetition of a <character string literal> 
that specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation IDC that is sensitive to case, compare equally 
according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

28) A <regular identifier> and a <delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> and the 
<delimited identifier body> of the <delimited identifier> (with all occurrences of <quote> replaced by <quote symbol> and all occurrences of 
<doublequote symbol> replaced by <double quote>), considered as the repetition of a <character string literal> that specifies a <character set specification>
 of SQL_IDENTIFIER and IDC, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

29) Two<delimited identifier>s are equivalent if their <delimited identifierbody>s,considered as the repetition of a <character string literal> that specifies
 a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the
 comparison rules in Subclause 8.2, “<comparison predicate>”.

30) Two <Unicode delimited identifier>s are equivalent if their <Unicode delimiter body>s, considered as the repetition of a <character string literal> that
 specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according
 to the comparison rules in Subclause 8.2, “<comparison predicate>”.

31) A <Unicode delimited identifier> and a <delimited identifier> are equivalent if their <Unicode delimiter body> and <delimited identifier body>, 
respectively, each considered as the repetition of a <character string literal> that specifies a <character set specification> of SQL_IDENTIFIER and 
an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

32) A <regular identifier> and a <Unicode delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> 
and the <Unicode delimiter body> of the <Unicode delimited identifier> considered as the repetition of a <character string literal>, each specifying a
 <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the 
comparison rules in Subclause 8.2, “<comparison predicate>”.

The approach and design is being captured here: https://github.com/prestosql/presto/wiki/Delimited-Identifiers

martint commented 5 years ago

We need to decide on what (if anything) we want to do for backwards compatibility around this change. There are some points for consideration:

So, some options to approach this change:

  1. Add new parallel "legacy" APIs by copying the existing APIs and mark the current ones as deprecated. Connectors that don't want to support the new API can just do a simple rename to use the newly introduced "legacy" versions. After a some grace period, replace the current APIs with the variant that understands delimited names. Eventually, remove all the legacy APIs.
  2. Add new APIs and mark the current ones as deprecated. Connectors can migrate to new APIs at will. Eventually, the current APIs get removed.
  3. Forgo backward compatibility and update all APIs to support delimited names.

Option 1) satisfies all the considerations listed above, but requires connectors to make a change at some point, even to keep the current behavior. Updating the connector implementations to understand delimited names is a mostly mechanical affair. So, updating connectors to work against the newly introduced legacy APIs is a waste of effort.

Option 2) leaves us with the poor choice of names for the long term.

Option 3), well, doesn't give us backward compatibility. On the other hand, since the change is mostly mechanical, especially if the connector wants to keep existing semantics, it seems like this wouldn't be too much of a burden.

The more I think about it, the more I lean towards option 3. Thoughts? Any other alternatives I haven't listed?

findepi commented 5 years ago

Option 2) leaves us with the poor choice of names for the long term.

Note that you can extend option 2 -- after the transition period, when old APIs are removed, you can rename new APIs to have better method names.

Updating the connector implementations to understand delimited names is a mostly mechanical affair.

@martint i think JDBC connectors require some thought here. Consider SHOW TABLES FROM schema example. If schema is delimited, this is actually easy, java.sql.DatabaseMetaData#getTables should do the trick. If schema is non-delimited, you need to find the actual schema name (case-insensitively) before calling java.sql.DatabaseMetaData#getTables. (We actually have something like that implemented, and can share this)

For this reason, I think we need some kind of transition period. Although such a transition might be implemented in each connector individually (under option 3), i mean leaning towards option 1.

martint commented 5 years ago

One downside of one is that during the transition period we'd have an API that's marked as deprecated and one that's considered legacy. Which one should new connector writers use? Clearly, not the deprecated one. But using the legacy one feels silly ("why can't I use a non-legacy one?") and forces them to adjust their usages later.

On the other hand, as you pointed out, with option 3 every connector has the choice of migrating to support delimited identifiers or preserve the current semantics (treat everything as non-delimited as today) and deal with it later.

findepi commented 5 years ago

@Praveen2112 @martint @kokosing

i was thinking about this today and came to conclusion that there are these concepts:

  1. object name (where object is a catalog, schema, table, view, role, function, user, etc.)
  2. identifier name in a query (can be quoted or not)

We need to distinguish them, because they are different.

So far #354 apparently uses one class (Name) for both these concepts (symmetrically) and uses Name#equals to match them. I would rather see these concepts distinguished in the code.

Proposed representation

Examples:

martint commented 5 years ago

The spec talks about object names being identifiers. This is pervasive across the spec document:

Let T be the table defined by the <table definition> TD. Let TN be the <table name> simply contained in TD.

A table descriptor TDS is created that describes T. TDS includes: a) The table name TN.

A base table descriptor describes a base table. In addition to the components of every table descriptor, a base table descriptor includes:

  • The name of the base table.

The grantee is <authorization identifier> A.

The row type RT of the table T defined by the <table definition> is the set of pairs (<field name>, <data type>) where <field name> is the name of a column C of T and <data type> is the declared type of C.

However, it then describes the domain of the columns that represent object names in the INFORMATION_SCHEMA and DEFINITION_SCHEMA tables using:

SQL_IDENTIFIER domain

Define a domain that contains all valid <identifier body>s and <delimited identifier body>s.

CREATE DOMAIN SQL_IDENTIFIER AS CHARACTER VARYING (L) CHARACTER SET SQL_IDENTIFIER

This domain specifies all variable-length character values that conform to the rules for formation and representation of an SQL <identifier body> or an SQL <delimited identifier body>

And:

The representation of an <identifier> in the base tables and views of the Information Schema is by a character string corresponding to its <identifier body> (in the case of a <regular identifier>) or its <delimited identifier body> (in the case of a ). Within this character string, any lower-case letter appearing in a <regular identifier> is replaced by the equivalent upper-case letter, and any <doublequote symbol> appearing in a <delimited identifier body> is replaced by a <double quote>.

As an example, this is how the SQL_IDENTIFIER domain is used in the definition of the CATALOG_NAME base table in DEFINITION_SCHEMA:

CREATE TABLE CATALOG_NAME (
    CATALOG_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER,
    CONSTRAINT CATALOG_NAME_PRIMARY_KEY
    PRIMARY KEY ( CATALOG_NAME )
)

This means that when presented in the contents of INFORMATION_SCHEMA and DEFINITION_SCHEMA tables, identifiers lose any indicators of whether they were delimited or not when the objects were created. This is ok, though, since the upper-case normalization above guarantees that if you used such identifiers in a query, they'd match the corresponding object even if they are surrounded with quotes.

It also hints to the fact that only the body of the identifiers is stored, but it's not explicit about it. The following statement does nothing to clarify this:

Where an <actual identifier> has multiple forms that are equal according to the rules of Subclause 8.2, “<comparison predicate>”, in ISO/IEC 9075-2, the form stored is that encountered at definition time.

Unfortunately, it doesn't clarify whether the "form stored" includes the quotes, if present.

The book "SQL: 1999: Understanding Relational Language Components" by Jim Melton, one of the editors of the SQL spec says:

In effect, SQL:1999 changes all lowercase letters in regular identifiers to their uppercase-equivalent letters. This is especially important for identifiers that are stored as data values in the views of the Information Schema. [...] When a delimited identifier is stored into the views of the Information Schema, the double quotes are not stored, but all the other characters are, just as they appeared in the delimited identifier. Therefore, the regular identifier TITLES and the delimited identifier "TITLES" are stored identically and are therefore completely equivalent.

PostgreSQL behaves exactly like that, except that it normalizes to lower-case before storing, opposite of what the spec describes.

So, I think you're partly right in that, in theory, all we need when connectors present a name to the engine is a string. However there are some things to consider:

findepi commented 5 years ago

@martint thanks for looking into this.

When connectors return a name to the engine, do they first need to normalize it according to SQL rules?

For existing objects, names should be returned as-is. If the remote storage is case insensitive (like Hive is, right?), we could add some normalization here. But for JDBC connectors we generally shouldn't. Table names are case-sensitive in eg Postgres, MySQL, SQL Server or Oracle. It is only "less convenient" (requires "-delimiting) to create tables with case different than the default.

NameSelector may not be an appropriate concept for all usages. When creating a table, how do we convey the table and column names to the connector?

Good point. When creating a table, we want to pass "identifier name from a query" to the connector. We can pass the value normalized to lower- (or upper-) -case unless it was "-delimited. (Even if we normalize to upper when talking to Postgres connector, it will still normalize to lower because the name is not delimited.) Plus, we need to retain information whether it was delimited.

My envisioned NameSelector(String name, boolean delimited) fits here perfectly... except for its name.

So:

Praveen2112 commented 5 years ago

@findepi Thanks for your insights.

let's call this concept SqlIdentifier(String name, boolean delimited)

But currently Name object also captures the same right i.e it maintains the rawName and the info whether it is delimited or not

kokosing commented 4 years ago

Just a reminder, that while fixing this issue we should also make sure that values returned from system catalog and information_schema have proper data and that predicate pushdown works for them. For example see: https://github.com/prestosql/presto/pull/4965

kokosing commented 2 years ago

Another thing to remember is to make sure that identifiers in event listeners are propagated in proper casing.

romanzdk commented 1 year ago

Is there any solution to the case sensitive table names in PostgreSQL? (we have table names in postgres like SomeTableName). I tried adding case-insensitive-name-matching=true to the catalog config but it did not help.

kokosing commented 1 year ago

(we have table names in postgres like SomeTableName). I tried adding case-insensitive-name-matching=true to the catalog config but it did not help.

Your PostgreSQL table SomeTableName should be exposed in Trino using sometablename name. Did it work? If not please file a bug.

romanzdk commented 1 year ago

Hmm.. That's really strange. Now when I try SELECT * FROM sometablename it works. It even works with combination like SomeTableName "SomeTableName" and "sometablename".. I swear yesterday none of them worked.. I do not understand.. But.. It works now.. Thanks!

kokosing commented 1 year ago

This issue is also exposed in SET ROLE where role names are lower cased and while the authorization system could be case sensitive.

prrvchr commented 4 months ago

The workaround is to only create tables in lower case but do you have anything planned on your side to resolve this problem?

shantanu-dahiya commented 2 weeks ago

Does the contributing team have plans to pick this up anytime soon?