Open martint opened 5 years ago
We need to decide on what (if anything) we want to do for backwards compatibility around this change. There are some points for consideration:
String
to a wrapper class (Name
?) that contains the value and whether it's delimited or not.List<String>
(e.g., List<String> listSchemaNames(ConnectorSession session)
)SchemaTable
has a String getCatalog()
method. The new method would need to be named differently, such as getOriginalCatalog()
if we wanted to preserve backward compatibility.So, some options to approach this change:
Option 1) satisfies all the considerations listed above, but requires connectors to make a change at some point, even to keep the current behavior. Updating the connector implementations to understand delimited names is a mostly mechanical affair. So, updating connectors to work against the newly introduced legacy APIs is a waste of effort.
Option 2) leaves us with the poor choice of names for the long term.
Option 3), well, doesn't give us backward compatibility. On the other hand, since the change is mostly mechanical, especially if the connector wants to keep existing semantics, it seems like this wouldn't be too much of a burden.
The more I think about it, the more I lean towards option 3. Thoughts? Any other alternatives I haven't listed?
Option 2) leaves us with the poor choice of names for the long term.
Note that you can extend option 2 -- after the transition period, when old APIs are removed, you can rename new APIs to have better method names.
Updating the connector implementations to understand delimited names is a mostly mechanical affair.
@martint i think JDBC connectors require some thought here. Consider SHOW TABLES FROM schema
example.
If schema
is delimited, this is actually easy, java.sql.DatabaseMetaData#getTables
should do the trick.
If schema
is non-delimited, you need to find the actual schema name (case-insensitively) before calling java.sql.DatabaseMetaData#getTables
. (We actually have something like that implemented, and can share this)
For this reason, I think we need some kind of transition period. Although such a transition might be implemented in each connector individually (under option 3), i mean leaning towards option 1.
One downside of one is that during the transition period we'd have an API that's marked as deprecated and one that's considered legacy. Which one should new connector writers use? Clearly, not the deprecated one. But using the legacy one feels silly ("why can't I use a non-legacy one?") and forces them to adjust their usages later.
On the other hand, as you pointed out, with option 3 every connector has the choice of migrating to support delimited identifiers or preserve the current semantics (treat everything as non-delimited as today) and deal with it later.
@Praveen2112 @martint @kokosing
i was thinking about this today and came to conclusion that there are these concepts:
We need to distinguish them, because they are different.
abc
is just the three lower-case letters; a table name cannot decide whether it should be matched case-insensitively or not"abc"
matches table name abc
, and query's abc
matches table names abc
, aBC
, etc.So far #354 apparently uses one class (Name
) for both these concepts (symmetrically) and uses Name#equals
to match them.
I would rather see these concepts distinguished in the code.
String
(or simple String wrapper class Name(String name)
) to represent object namesNameSelector(String name, boolean delimited)
to represent identifier name in a queryNameSelector#matches(String objectName)
(or NameSelector#matches(Name objectName)
) to match identifier name in a query against object nameExamples:
ConnectorMetadata#schemaExists
will take NameSelector
ConnectorMetadata#listSchemaNames
will return List<String>
(or List<Name>
)ConnectorMetadata#listTables
will take String
(or Name
) -- i.e. resolved, actual schema nameThe spec talks about object names being identifiers. This is pervasive across the spec document:
Let
T
be the table defined by the<table definition>
TD
. LetTN
be the<table name>
simply contained inTD
.A table descriptor
TDS
is created that describesT
.TDS
includes: a) The table nameTN
.A base table descriptor describes a base table. In addition to the components of every table descriptor, a base table descriptor includes:
- The name of the base table.
The grantee is
<authorization identifier>
A.The row type
RT
of the tableT
defined by the<table definition>
is the set of pairs (<field name>
,<data type>
) where<field name>
is the name of a columnC
ofT
and<data type>
is the declared type ofC
.
However, it then describes the domain of the columns that represent object names in the INFORMATION_SCHEMA
and DEFINITION_SCHEMA
tables using:
SQL_IDENTIFIER domain
Define a domain that contains all valid
<identifier body>
s and<delimited identifier body>
s.
CREATE DOMAIN SQL_IDENTIFIER AS CHARACTER VARYING (L) CHARACTER SET SQL_IDENTIFIER
This domain specifies all variable-length character values that conform to the rules for formation and representation of an SQL
<identifier body>
or an SQL<delimited identifier body>
And:
The representation of an
<identifier>
in the base tables and views of the Information Schema is by a character string corresponding to its<identifier body>
(in the case of a<regular identifier>
) or its<delimited identifier body>
(in the case of a). Within this character string, any lower-case letter appearing in a <regular identifier>
is replaced by the equivalent upper-case letter, and any<doublequote symbol>
appearing in a<delimited identifier body>
is replaced by a<double quote>
.
As an example, this is how the SQL_IDENTIFIER domain is used in the definition of the CATALOG_NAME
base table in DEFINITION_SCHEMA
:
CREATE TABLE CATALOG_NAME (
CATALOG_NAME INFORMATION_SCHEMA.SQL_IDENTIFIER,
CONSTRAINT CATALOG_NAME_PRIMARY_KEY
PRIMARY KEY ( CATALOG_NAME )
)
This means that when presented in the contents of INFORMATION_SCHEMA
and DEFINITION_SCHEMA
tables, identifiers lose any indicators of whether they were delimited or not when the objects were created. This is ok, though, since the upper-case normalization above guarantees that if you used such identifiers in a query, they'd match the corresponding object even if they are surrounded with quotes.
It also hints to the fact that only the body of the identifiers is stored, but it's not explicit about it. The following statement does nothing to clarify this:
Where an
<actual identifier>
has multiple forms that are equal according to the rules of Subclause 8.2, “<comparison predicate>
”, in ISO/IEC 9075-2, the form stored is that encountered at definition time.
Unfortunately, it doesn't clarify whether the "form stored" includes the quotes, if present.
The book "SQL: 1999: Understanding Relational Language Components" by Jim Melton, one of the editors of the SQL spec says:
In effect, SQL:1999 changes all lowercase letters in regular identifiers to their uppercase-equivalent letters. This is especially important for identifiers that are stored as data values in the views of the Information Schema. [...] When a delimited identifier is stored into the views of the Information Schema, the double quotes are not stored, but all the other characters are, just as they appeared in the delimited identifier. Therefore, the regular identifier
TITLES
and the delimited identifier"TITLES"
are stored identically and are therefore completely equivalent.
PostgreSQL behaves exactly like that, except that it normalizes to lower-case before storing, opposite of what the spec describes.
So, I think you're partly right in that, in theory, all we need when connectors present a name to the engine is a string. However there are some things to consider:
NameSelector
may not be an appropriate concept for all usages. When creating a table, how do we convey the table and column names to the connector? We could use a string and normalize the name according to the rules in the SQL spec before providing it to connectors, but that would break connectors that don't play by the SQL rules. We'd either have to 1) add "delimited" to name, which brings us back to square one 2) add yet another entity to represent a "sql identifier".@martint thanks for looking into this.
When connectors return a name to the engine, do they first need to normalize it according to SQL rules?
For existing objects, names should be returned as-is.
If the remote storage is case insensitive (like Hive is, right?), we could add some normalization here.
But for JDBC connectors we generally shouldn't.
Table names are case-sensitive in eg Postgres, MySQL, SQL Server or Oracle. It is only "less convenient" (requires "
-delimiting) to create tables with case different than the default.
NameSelector
may not be an appropriate concept for all usages. When creating a table, how do we convey the table and column names to the connector?
Good point. When creating a table, we want to pass "identifier name from a query" to the connector.
We can pass the value normalized to lower- (or upper-) -case unless it was "
-delimited.
(Even if we normalize to upper when talking to Postgres connector, it will still normalize to lower because the name is not delimited.)
Plus, we need to retain information whether it was delimited.
My envisioned NameSelector(String name, boolean delimited)
fits here perfectly... except for its name.
So:
SqlIdentifier(String name, boolean delimited)
String
(or Name
, or ObjectName
) for names of existing objects@findepi Thanks for your insights.
let's call this concept SqlIdentifier(String name, boolean delimited)
But currently Name
object also captures the same right i.e it maintains the rawName and the info whether it is delimited or not
Just a reminder, that while fixing this issue we should also make sure that values returned from system
catalog and information_schema
have proper data and that predicate pushdown works for them. For example see: https://github.com/prestosql/presto/pull/4965
Another thing to remember is to make sure that identifiers in event listeners are propagated in proper casing.
Is there any solution to the case sensitive table names in PostgreSQL? (we have table names in postgres like SomeTableName
). I tried adding case-insensitive-name-matching=true
to the catalog config but it did not help.
(we have table names in postgres like SomeTableName). I tried adding case-insensitive-name-matching=true to the catalog config but it did not help.
Your PostgreSQL table SomeTableName
should be exposed in Trino using sometablename
name. Did it work? If not please file a bug.
Hmm.. That's really strange. Now when I try SELECT * FROM sometablename
it works. It even works with combination like SomeTableName
"SomeTableName"
and "sometablename"
.. I swear yesterday none of them worked.. I do not understand.. But.. It works now.. Thanks!
This issue is also exposed in SET ROLE
where role names are lower cased and while the authorization system could be case sensitive.
The workaround is to only create tables in lower case but do you have anything planned on your side to resolve this problem?
Does the contributing team have plans to pick this up anytime soon?
The approach and design is being captured here: https://github.com/prestosql/presto/wiki/Delimited-Identifiers