Open marcinsbd opened 1 year ago
The time zone conversion in the Hive writer happens here: https://github.com/trinodb/trino/blob/e82b604b677069c937f78aa2d202ffc604048106/lib/trino-parquet/src/main/java/io/trino/parquet/writer/valuewriter/Int96TimestampValueWriter.java#L91
We should see how the Iceberg implementation handles this and how they Iceberg community thinks this should behave. My thought is that files with non-UTC INT96 timestamps are not compatible with Iceberg, unless they add a new Iceberg metadata property that tells how to convert it on read.
When we insert timestamp to a Hive table with Parquet file format and migrate the table to Iceberg, the values read by Hive and Iceberg differs.
Hive: It occurs that Hive takes the timestamp and adjust it according to the writer's TZ to get UTC value which then is written down. Hive adds also the writer TZ info to the footer. During read, Hive returns the original value. It reads the timestamp in UTC and adjust it according to the writer's TZ.
Trino uses the following property hive.parquet.time-zone for setting Parquet reader's and Parquet writer's timezone when dealing with Hive. https://github.com/trinodb/trino/blob/0b3b661cb275eeec6c57401f0981a74e775f87fd/plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriterFactory.java#L84
https://github.com/trinodb/trino/blob/0b3b661cb275eeec6c57401f0981a74e775f87fd/plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSourceFactory.java#L147
Iceberg: Trino always sets the UTC timezone for setting Parquet reader's and Parquet writer's timezone. In this way, it always return UTC values.
https://github.com/trinodb/trino/blob/0b3b661cb275eeec6c57401f0981a74e775f87fd/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java#L1010
The following test scenario shows the case(run within TZ UTC+1):
trino> CREATE TABLE hive.tpch.table1 (x timestamp(3)) with (format='PARQUET' ) ; CREATE TABLE
trino> INSERT INTO hive.tpch.table1 VALUES timestamp '2021-01-01 06:12:12.123'; INSERT: 1 row
trino> SELECT * from hive.tpch.table1; 2021-01-01 06:12:12.123
trino> CALL iceberg.system.migrate('tpch', 'table1');
trino> select * from iceberg.tpch.table1; 2021-01-01 05:12:12.123000
The values returned are different by writer's TZ 2021-01-01 06:12:12.123 <> 2021-01-01 05:12:12.123000