Open marton-bod opened 5 months ago
default TableStatistics getTableStats()
the table handle is a data object, it cannot do IO. getting statistics often requires doing an IO
generally, the handle interfaces are so called "marker interfaces". They do not provide operations. All operations are provided by the connector, e.g. via ConnectorMetadata.
default boolean supportsSorting()
Table handle represents a table, or a relation, an effect of pushdowns etc. What does it mean that "table or relation supports sorting"?
the table handle is a data object, it cannot do IO. getting statistics often requires doing an IO
Agree with you on this. I should clarify: it would return an Optional\<TableStatistics>, only when stats have been read as part of the query anyway (e.g. a join). Otherwise it would be empty. All the methods just return stored fields, there is no computation happening.
What does it mean that "table or relation supports sorting"?
Maybe I need to rename the method to make it clear. I was referring to the table supporting the concept of a sort spec / sorting columns (like Iceberg).
: it would return an Optional
, only when stats have been read as part of the query anyway (e.g. a join). Otherwise it would be empty.
stats are read in io.trino.spi.connector.ConnectorMetadata#getTableStatistics
which takes table handle (and does not return a new table handle). thus table handle cannot know whether stats were read or not
All the methods just return stored fields, there is no computation happening.
Engine / Trino SPI makes no assumption about how table handle is implemented, and how the information is stored (fields vs something else). In fact, in extreme case, it can have no fields at all if a connector provides exactly one table.
stats are read in io.trino.spi.connector.ConnectorMetadata#getTableStatistics which takes table handle (and does not return a new table handle). thus table handle cannot know whether stats were read or not
That's correct, and this is something I'm trying to figure out still. As a temporary solution, I've used a setter to store the stats in the handle, but I know it's not the best pattern.
Engine / Trino SPI makes no assumption about how table handle is implemented, and how the information is stored (fields vs something else)
Agree completely. That's why I'm not aiming to change the ConnectorTableHandle marker interface which needs to stay completely generic, but instead introduce a DataLakeTableHandle for Iceberg/Hive/Hudi/Delta, where there are shared similarities that we can take as a starting point (there are columns, partitioning, etc)
Proposal is to have a new interface,
DataLakeTableHandle
, which extendsConnectorTableHandle
and will be implemented by the Iceberg/Delta/Hudi/Hive table handles. This new table handle interface would expose some common shared concepts such as:One use case is to then include some of this information into QueryCompletedEvent/TableInfo, unlocking the opportunity to perform more in-depth offline analyses and provide recommendations to users (e.g. compare most frequently-used predicate columns to the partitioning/sorting scheme)