Open rdblue opened 5 years ago
While this sounds like an interesting feature, it's problematic for a number of reasons.
The SQL specification defines a rigid entity namespace structure composed of catalog, schema and object name. Extending that implies going beyond what's specified, so it requires extra careful thought and consideration.
One of the most important questions is whether more parts implies more hierarchy levels or is just a syntactic extension to the set of characters allowed for a table name. Both options have downsides.
Additional hierarchy levels introduce ambiguities for commands that depend on the rigid 3-part naming structure. It's unclear how commands like USE <catalog>.<schema>
, USE <catalog>
, SHOW SCHEMAS FROM <catalog>
, SHOW TABLES FROM [<catalog> .] <schema>
, CREATE TABLE [[<catalog> .] <schema> .] <table>
, etc., would work in this world. If x
is a table, is x.y
some kind of nested table? How does one discover such tables? What about nested catalogs or nested schemas? With this approach, there's a competing desire for allowing nested catalogs, instead (e.g., to support "mounting" entire databases into a single Presto catalog) -- something we considered in the past, but didn't go anywhere for all these reasons.
It also affects SELECT
queries in non-trivial ways. For instance, the SQL specification is clear about x.y
in the following query referring to table y
under schema x
in the current catalog:
SELECT * FROM x.y
Changing the semantics to refer to nested table y
under table x
under the current schema/catalog is a departure from the specification. Querying such nested tables would require using their fully-qualified name, which doesn't seem particularly user friendly.
Similarly, what would CREATE TABLE x.y
do? Currently (and per the SQL specification), it means "create table y
under schema x
in the current catalog".
The option of a syntactic extension to the set of characters allowed for table names also has additional problems. First, there's the cognitive discordance of a.b.c.d
not "splitting" uniformly as (a, b, c, d) parts, but as (a, b, c.d). It also introduces parsing ambiguities. For instance, in a query like:
SELECT * FROM x.y
Does x.y
parse as (. "x" "y")
or ("x.y")
?
In the latter case, a query that looks valid to the naked eye (and is actually valid SQL if x
were a schema), would actually not work:
SELECT y.some_column FROM x.y
I agree that it is good to be careful, but I think there are reasonable ways to make this work.
Let's ignore the case of nested tables to start with. I think that can over-complicate this because it mixes two things together that should be separate: how to interpret a identifiers and what an identifier references.
The main argument here appears to be that x.y
refers to schema x
and table y
. That still works with a simple set of rules:
y
, refers to a table in the session's current schema and the default catalog. The full identifier passed to the catalog is (current-schema).y
.x.y
refers to table y
in schema x
, so the identifier passed to the default catalog is x.y
.prod.x.y
loads x.y
from the prod
catalog.Following these rules, you can always get the catalog and the full identifier to load from that catalog. After that, what gets loaded is up to the catalog, so a metadata table like x.y.history
doesn't affect how identifiers are interpreted. The catalog can also allow listing x.y
as a namespace in addition to loading it as a table for discovery.
The only overlap is between the default catalog and full catalog identifiers: a.b.c
could be table b.c
in catalog a
or could be the table a.b.c
in the default catalog. But, the set of catalogs is controlled by administrators, doesn't change often, and determined at configuration time so this is would only rarely cause confusion.
These rules also work for several of the cases you mentioned. The multi-part identifier passed to USE
is broken down by checking the first part. If it is a catalog, then the rest is the namespace. Otherwise, the entire identifier is a namespace. That is used to set the current namespace, and the catalog is changed if it was set. (You can also disallow changing the catalog this way and fail if you see a catalog -- interpreting the identifier is the same whatever behavior you choose.)
this would help Sybase connector (https://github.com/prestosql/presto/issues/2481) as server.db.owner.table is valid identifier there
The Iceberg table format exposes metadata tables to make it easy to inspect a table. This includes data files, manifest files, snapshots, and table history. (Docs)
In Spark, Iceberg exposes the metadata tables by adding another name part to the table identifier. History for
db.table
is loaded usingdb.table.history
. This is easy for users to remember and doesn't require users to escape special characters.Is it possible to update Presto to use the same convention? I know there is some support for
db."table$metaTableName"
, but this is awkward compared to adding another identifier part. It also requires remembering how to escape$
.For background, Spark allows identifying tables by any number of identifier parts. This allows a catalog to expose its internal structure as well as supporting metadata table names.