Support non iceberg native types on iceberg connector

hantangwangd commented 3 months ago

If we want to support non Iceberg native types on Iceberg connector, for example char(10) or interval year to month, we need to store the column data as Iceberg native types and then record extra type information attached to the native type of that column. For example, for table creation statement:

create table iceberg.default.test_table(a int, b varchar, c interval year to month, d char(10));

We need to record the following additional metadata somewhere:

The data type of test_table.c is INT, and the extra type information is INTERVAL YEAR TO MONTH
The data type of test_table.d is INT, and the extra type information is CHAR(10)

In this way, the Presto engine can use these extra information internally to interpret the actual stored Iceberg native type data into the corresponding Presto engine type.

However, Iceberg's own types system is very streamlined and does not provide a place to record extra type information, nor does it provide the support for type extension. Therefore, we need to figure out a place to record the extra type information corresponding to each column. There are two options:

Utilizing an external metadata storage system, in which record the mapping of schema.table.column -> extra type info
Saving this information within Iceberg metadata, for example, record extra type information in the attribute doc of the nested field corresponding to the data column

The first option requires support from an external metadata storage system and introduces some complexity in terms of usage and coding.

When considering the second option, an arbitrary string can be stored in the attribute doc of the nested field in Iceberg, which is currently mainly used to store the comment information specified through --comment when creating columns. Therefore, we can expand its content by adding extra type information that is only visible and used internally by the computing engine. In this way, we do not need the support of external metadata storage and avoid the tedious work of maintaining the mapping between columns and their corresponding type extension informations. So this may be a better solution.

Take CharType as an example:

When a column a is declared as char(10), we save it as a String type in Iceberg and append the following extra type information at the end of the nestedField.doc corresponding to column a:

        --[TYPE_EXTRA_INFO(length=10)]

When reading column a in the Presto engine, we obtain its Iceberg type as String and the extra type information 'length=10' recorded in the corresponding nestedField.doc. Therefore, we will interpret it as a char(10) type inside the calculating engine.
The extra type information --[TYPE_EXTRA_INFO(...)] is only visible within the calculating engine. Therefore, when users view the columns information of test_table through 'show create table test_table' or 'show columns in test_table', this extended information will not be displayed in the comment of column d, and its corresponding type will be char(10).

This feature can be used to support types that are supported in Presto but not natively supported by Iceberg, such as CharType, IntervalYearMonthType, IntervalDayTimeType, etc.

Additionally, for types that may be natively supported by Iceberg in future, such as CharType (Referring to: https://github.com/apache/iceberg/issues/10461, although uncertain whether it will be adopted or when it will be adopted), due to our implementation above being based on extra metadata information, we can easily accommodate both the old and new version of these types.

Expected Behavior or Use Case

Provide a way to support types that are supported in Presto but not natively supported by Iceberg Support full-size CharType which is a common type in TPCDS and TPCH

Presto Component, Service, or Connector

Iceberg Connector

Possible Implementation

PR #23398

hantangwangd commented 3 months ago

cc: @tdcmeehan @ZacBlanco

tdcmeehan commented 3 months ago

If the extra type information is stored in the metadata, won't it be susceptible to cross-engine compatibility issues? For example, if both Presto and Spark write to a table, wouldn't it be possible for Spark to clobber the extra type information?

hantangwangd commented 3 months ago

Thanks for the information, that's a great question. When considering cross engine compatibility, other engines can indeed destroy or erase these extra type information by altering table to modify the comment of columns. The result would be that the type of the column degenerates from the extended type to the corresponding native type.

It's indeed a flaw in this solution, and currently I can't figure out a way to avoid it. It seems that this problem can only be completely solved at the level of Iceberg library implementation.

I don't have a strong feeling about this solution, just bring it for discussion. Maybe it's better to leave things as they are until we figure out a better solution.

mderoy commented 2 months ago

I think the best we can do is promote movement on https://github.com/apache/iceberg/issues/10461 for those character types... currently I'm the only one who has brought this up so probably they do not think the community has much of a demand for it...but as I wrote in the proposal, fixed and padded string types are super important for lots of SQL engines.

As for the native type annotations... I like it.... but I have the same concerns as Tim... I've been thinking about how we might be able to support some of the currently unsupported datatypes in another engine, and the best I've come up with is using tableproperties to annotate the FIDs for certain columns.... something like EXTRA_TYPE_INFO={0:VARCHAR(20), 2:CHAR(2)}..... but since other engines won't necessarily respect these annotations it really makes things difficult since they can insert data longer than these fixed limits, or without the proper padding... probably custom types should be handled in BINARY or FIXED iceberg types, though that hurts interoperability of those tables.

hantangwangd commented 2 months ago

Hi @mderoy, thanks for providing your perspective.

as I wrote in the proposal, fixed and padded string types are super important for lots of SQL engines.

I agree with this, especially considering the large number of char type columns defined and used in tpc-h and tpc-ds. It would definitely be the best if the Iceberg community supports fixed length string types as native primitive types.

since other engines won't necessarily respect these annotations it really makes things difficult since they can insert data longer than these fixed limits, or without the proper padding... probably custom types should be handled in BINARY or FIXED iceberg types, though that hurts interoperability of those tables.

When considering cross engine issues, things are indeed quite challenging, for example, a worse scenario, there are no rules that can prevent another engine from destroying these extended type annotations, no matter it's recorded in the comment of columns or in table properties.

So maybe the best solution is that Iceberg community define this extension mechanism at the specification level.

tdcmeehan commented 2 months ago

I doubt that this extension mechanism would be codified in the Iceberg spec, because this could increase fragmentation in the Iceberg ecosystem by introducing ambiguity which previously did not exist. While the spec change would prevent metadata from being clobbered by systems which do not recognize the custom type, it won't solve the problem of how do systems recognize these types.

At the same time, businesses who are trying to provide services around Iceberg cannot be constrained by the specification, in the same way that SQL engines often introduce features before there is guidance added in the SQL standard. There should be some flexibility for a middle ground that allows you to enrich functionality that is found in the spec, without introducing incompatible or potentially corruptible changes.

I think a catalog-oriented solution is the most realistic way of going about this. I can see two potential approaches:

1) Write a custom catalog, and develop in the Presto Iceberg connector the ability to load a custom catalog and properly integrate it. This custom catalog can decorate existing functionality in TableOperations and add derivations of Type based on additional information in a proprietary catalog or file. 2) In a similar way, I think it should be possible to provide enriched metadata through a custom implementation of the REST catalog server. This custom catalog can derive the table information primarily from another catalog, but add additional information which we could interpret through a minimal implementation of the REST catalog client (a subclass which has some special hooks for things such as additional types).

Tl;dr I think the important thing to do here is allow for the flexibility for private parties to enrich what is 100% Iceberg spec conformant, without introducing incompatible changes to the underlying manifest and metadata files.

hantangwangd commented 2 months ago

I think the important thing to do here is allow for the flexibility for private parties to enrich what is 100% Iceberg spec conformant, without introducing incompatible changes to the underlying manifest and metadata files.

Agree, it seems that this is currently the most realistic and secure solution considering the fact that Iceberg spec is not so easy to make change . It's worth doing some detailed investigations following this thinking.

prestodb / presto