prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.91k stars 5.33k forks source link

Support non iceberg native types on iceberg connector #23410

Open hantangwangd opened 1 month ago

hantangwangd commented 1 month ago

If we want to support non Iceberg native types on Iceberg connector, for example char(10) or interval year to month, we need to store the column data as Iceberg native types and then record extra type information attached to the native type of that column. For example, for table creation statement:

create table iceberg.default.test_table(a int, b varchar, c interval year to month, d char(10));

We need to record the following additional metadata somewhere:

In this way, the Presto engine can use these extra information internally to interpret the actual stored Iceberg native type data into the corresponding Presto engine type.

However, Iceberg's own types system is very streamlined and does not provide a place to record extra type information, nor does it provide the support for type extension. Therefore, we need to figure out a place to record the extra type information corresponding to each column. There are two options:

The first option requires support from an external metadata storage system and introduces some complexity in terms of usage and coding.

When considering the second option, an arbitrary string can be stored in the attribute doc of the nested field in Iceberg, which is currently mainly used to store the comment information specified through --comment when creating columns. Therefore, we can expand its content by adding extra type information that is only visible and used internally by the computing engine. In this way, we do not need the support of external metadata storage and avoid the tedious work of maintaining the mapping between columns and their corresponding type extension informations. So this may be a better solution.

Take CharType as an example:

        --[TYPE_EXTRA_INFO(length=10)]

This feature can be used to support types that are supported in Presto but not natively supported by Iceberg, such as CharType, IntervalYearMonthType, IntervalDayTimeType, etc.

Additionally, for types that may be natively supported by Iceberg in future, such as CharType (Referring to: https://github.com/apache/iceberg/issues/10461, although uncertain whether it will be adopted or when it will be adopted), due to our implementation above being based on extra metadata information, we can easily accommodate both the old and new version of these types.

Expected Behavior or Use Case

Provide a way to support types that are supported in Presto but not natively supported by Iceberg Support full-size CharType which is a common type in TPCDS and TPCH

Presto Component, Service, or Connector

Iceberg Connector

Possible Implementation

PR #23398

hantangwangd commented 1 month ago

cc: @tdcmeehan @ZacBlanco

tdcmeehan commented 1 month ago

If the extra type information is stored in the metadata, won't it be susceptible to cross-engine compatibility issues? For example, if both Presto and Spark write to a table, wouldn't it be possible for Spark to clobber the extra type information?

hantangwangd commented 1 month ago

Thanks for the information, that's a great question. When considering cross engine compatibility, other engines can indeed destroy or erase these extra type information by altering table to modify the comment of columns. The result would be that the type of the column degenerates from the extended type to the corresponding native type.

It's indeed a flaw in this solution, and currently I can't figure out a way to avoid it. It seems that this problem can only be completely solved at the level of Iceberg library implementation.

I don't have a strong feeling about this solution, just bring it for discussion. Maybe it's better to leave things as they are until we figure out a better solution.