trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.43k stars 3k forks source link

Support additional identifier parts for tables #1135

Open rdblue opened 5 years ago

rdblue commented 5 years ago

The Iceberg table format exposes metadata tables to make it easy to inspect a table. This includes data files, manifest files, snapshots, and table history. (Docs)

In Spark, Iceberg exposes the metadata tables by adding another name part to the table identifier. History for db.table is loaded using db.table.history. This is easy for users to remember and doesn't require users to escape special characters.

Is it possible to update Presto to use the same convention? I know there is some support for db."table$metaTableName", but this is awkward compared to adding another identifier part. It also requires remembering how to escape $.

For background, Spark allows identifying tables by any number of identifier parts. This allows a catalog to expose its internal structure as well as supporting metadata table names.

martint commented 5 years ago

While this sounds like an interesting feature, it's problematic for a number of reasons.

The SQL specification defines a rigid entity namespace structure composed of catalog, schema and object name. Extending that implies going beyond what's specified, so it requires extra careful thought and consideration.

One of the most important questions is whether more parts implies more hierarchy levels or is just a syntactic extension to the set of characters allowed for a table name. Both options have downsides.

Additional hierarchy levels introduce ambiguities for commands that depend on the rigid 3-part naming structure. It's unclear how commands like USE <catalog>.<schema>, USE <catalog>, SHOW SCHEMAS FROM <catalog>, SHOW TABLES FROM [<catalog> .] <schema>, CREATE TABLE [[<catalog> .] <schema> .] <table>, etc., would work in this world. If x is a table, is x.y some kind of nested table? How does one discover such tables? What about nested catalogs or nested schemas? With this approach, there's a competing desire for allowing nested catalogs, instead (e.g., to support "mounting" entire databases into a single Presto catalog) -- something we considered in the past, but didn't go anywhere for all these reasons.

It also affects SELECT queries in non-trivial ways. For instance, the SQL specification is clear about x.y in the following query referring to table y under schema x in the current catalog:

SELECT * FROM x.y

Changing the semantics to refer to nested table y under table x under the current schema/catalog is a departure from the specification. Querying such nested tables would require using their fully-qualified name, which doesn't seem particularly user friendly.

Similarly, what would CREATE TABLE x.y do? Currently (and per the SQL specification), it means "create table y under schema x in the current catalog".

The option of a syntactic extension to the set of characters allowed for table names also has additional problems. First, there's the cognitive discordance of a.b.c.d not "splitting" uniformly as (a, b, c, d) parts, but as (a, b, c.d). It also introduces parsing ambiguities. For instance, in a query like:

SELECT * FROM x.y

Does x.y parse as (. "x" "y") or ("x.y")?

In the latter case, a query that looks valid to the naked eye (and is actually valid SQL if x were a schema), would actually not work:

SELECT y.some_column FROM x.y
rdblue commented 5 years ago

I agree that it is good to be careful, but I think there are reasonable ways to make this work.

Let's ignore the case of nested tables to start with. I think that can over-complicate this because it mixes two things together that should be separate: how to interpret a identifiers and what an identifier references.

The main argument here appears to be that x.y refers to schema x and table y. That still works with a simple set of rules:

  1. A single identifier, like y, refers to a table in the session's current schema and the default catalog. The full identifier passed to the catalog is (current-schema).y.
  2. Any identifier with multiple parts is interpreted as a full identifier. x.y refers to table y in schema x, so the identifier passed to the default catalog is x.y.
  3. If the first part of a multi-part identifier is a catalog, then interpret the remaining parts as the identifier to pass to the catalog. For example, prod.x.y loads x.y from the prod catalog.

Following these rules, you can always get the catalog and the full identifier to load from that catalog. After that, what gets loaded is up to the catalog, so a metadata table like x.y.history doesn't affect how identifiers are interpreted. The catalog can also allow listing x.y as a namespace in addition to loading it as a table for discovery.

The only overlap is between the default catalog and full catalog identifiers: a.b.c could be table b.c in catalog a or could be the table a.b.c in the default catalog. But, the set of catalogs is controlled by administrators, doesn't change often, and determined at configuration time so this is would only rarely cause confusion.

These rules also work for several of the cases you mentioned. The multi-part identifier passed to USE is broken down by checking the first part. If it is a catalog, then the rest is the namespace. Otherwise, the entire identifier is a namespace. That is used to set the current namespace, and the catalog is changed if it was set. (You can also disallow changing the catalog this way and fail if you see a catalog -- interpreting the identifier is the same whatever behavior you choose.)

tooptoop4 commented 4 years ago

this would help Sybase connector (https://github.com/prestosql/presto/issues/2481) as server.db.owner.table is valid identifier there