Open hantangwangd opened 1 month ago
cc: @tdcmeehan @ZacBlanco
If the extra type information is stored in the metadata, won't it be susceptible to cross-engine compatibility issues? For example, if both Presto and Spark write to a table, wouldn't it be possible for Spark to clobber the extra type information?
Thanks for the information, that's a great question. When considering cross engine compatibility, other engines can indeed destroy or erase these extra type information by altering table to modify the comment of columns. The result would be that the type of the column degenerates from the extended type to the corresponding native type.
It's indeed a flaw in this solution, and currently I can't figure out a way to avoid it. It seems that this problem can only be completely solved at the level of Iceberg library implementation.
I don't have a strong feeling about this solution, just bring it for discussion. Maybe it's better to leave things as they are until we figure out a better solution.
If we want to support non Iceberg native types on Iceberg connector, for example
char(10)
orinterval year to month
, we need to store the column data as Iceberg native types and then record extra type information attached to the native type of that column. For example, for table creation statement:We need to record the following additional metadata somewhere:
test_table.c
isINT
, and the extra type information isINTERVAL YEAR TO MONTH
test_table.d
isINT
, and the extra type information isCHAR(10)
In this way, the Presto engine can use these extra information internally to interpret the actual stored Iceberg native type data into the corresponding Presto engine type.
However, Iceberg's own types system is very streamlined and does not provide a place to record extra type information, nor does it provide the support for type extension. Therefore, we need to figure out a place to record the extra type information corresponding to each column. There are two options:
schema.table.column -> extra type info
doc
of the nested field corresponding to the data columnThe first option requires support from an external metadata storage system and introduces some complexity in terms of usage and coding.
When considering the second option, an arbitrary string can be stored in the attribute
doc
of the nested field in Iceberg, which is currently mainly used to store the comment information specified through--comment
when creating columns. Therefore, we can expand its content by adding extra type information that is only visible and used internally by the computing engine. In this way, we do not need the support of external metadata storage and avoid the tedious work of maintaining the mapping between columns and their corresponding type extension informations. So this may be a better solution.Take
CharType
as an example:char(10)
, we save it as a String type in Iceberg and append the following extra type information at the end of thenestedField.doc
corresponding to column a:When reading column a in the Presto engine, we obtain its Iceberg type as String and the extra type information 'length=10' recorded in the corresponding
nestedField.doc
. Therefore, we will interpret it as achar(10)
type inside the calculating engine.The extra type information
--[TYPE_EXTRA_INFO(...)]
is only visible within the calculating engine. Therefore, when users view the columns information oftest_table
through 'show create table test_table' or 'show columns in test_table', this extended information will not be displayed in the comment of column d, and its corresponding type will bechar(10)
.This feature can be used to support types that are supported in Presto but not natively supported by Iceberg, such as
CharType
,IntervalYearMonthType
,IntervalDayTimeType
, etc.Additionally, for types that may be natively supported by Iceberg in future, such as CharType (Referring to: https://github.com/apache/iceberg/issues/10461, although uncertain whether it will be adopted or when it will be adopted), due to our implementation above being based on extra metadata information, we can easily accommodate both the old and new version of these types.
Expected Behavior or Use Case
Provide a way to support types that are supported in Presto but not natively supported by Iceberg Support full-size
CharType
which is a common type in TPCDS and TPCHPresto Component, Service, or Connector
Iceberg Connector
Possible Implementation
PR #23398