trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.49k stars 3.02k forks source link

Trino Iceberg Timestamp with Timezone Query Rendering/Display Enhancment #21871

Open sbrackenbury-teranet opened 6 months ago

sbrackenbury-teranet commented 6 months ago

A Trino query against an iceberg table timestamp column renders the query result differently from how Spark renders the query result for the same iceberg table:

Spark-SQL:

set time zone local;

CREATE TABLE spark_catalog.ts_iceberg.example_timestamp_ltz (ltz TIMESTAMP) USING iceberg 
LOCATION '<a_path>/ts_iceberg.db/example_timestamp_ltz' 
TBLPROPERTIES ('current-snapshot-id' = 'none', 
                                 'format' = 'iceberg/parquet', 
                                 'format-version' = '2')

insert into ts_iceberg.example_timestamp_ltz values(timestamp '2024-03-25 12:12:12.123456 America/Toronto');

select * from ts_iceberg.example_timestamp_ltz;
2024-03-25 16:12:12.123456  <- UTC normalized and timestamp is rendered using local time zone (UTC)

set time zone 'America/Toronto';
select * from ts_iceberg.example_timestamp_ltz;
2024-03-25 12:12:12.123456  <- timestamp rendered using time zone ‘America/Toronto’

Trino SQL (against the same Spark generated iceberg table above):

set time zone local;
select * from iceberg.ts_iceberg.example_timestamp_ltz;
2024-03-25 16:12:12.123456 UTC  <- timestamp rendered with ‘UTC’ suffix which is different than Spark

set time zone 'America/Toronto';
select * from iceberg.ts_iceberg.example_timestamp_ltz;
2024-03-25 16:12:12.123456 UTC  <- set time zone‘America/Toronto’ ignored – UTC localized with UTC suffix

Postgres rendering behavior for a table with a timestamptz column is like Spark:

create table public.test_iceberg (test_column timestamptz NULL);

ez_utility_uat=> show timezone;
TimeZone
----------
 UTC
(1 row)

ez_utility_uat=> insert into public.test_iceberg_ts_wtz values(timestamptz '2024-03-25 12:12:12.123456 America/Toronto');
INSERT 0 1

ez_utility_uat=> select * from public.test_iceberg_ts_wtz;
          test_column
-------------------------------
 2024-03-25 16:12:12.123456+00  <- UTC normalized and timestamp is rendered using local time zone (UTC)
(1 row)

ez_utility_uat=> set time zone 'America/Toronto';
SET

ez_utility_uat=> select * from public.test_iceberg_ts_wtz;
          test_column
-------------------------------
 2024-03-25 12:12:12.123456-04  <- timestamp rendered using session time zone ‘America/Toronto’
(1 row) 

The only option available in Trino for rendering the timestamp column at the desired timezone is to employ the Trino at_timezone function. This is undesirable as it requires a code change to existing queries. Adding the ability for Trino to transparently render the Timestamp based on a set/configured Timezone similar to how Spark and Postgres behaves is preferred because it is less intrusive.

This PR introduces configurable Trino query behavior for rendering Iceberg table timestamp columns in the same manner that Spark and Postgres render timestamp with time zone data type columns. It simplifies and encourages the adoption of Iceberg format tables that have columns of type timestamp with time zone.

The enhancement ensures that an Iceberg table timestamp column value:

1) Is normalized to UTC time 2) Is rendered at query time according to a default or specified session Time Zone

martint commented 6 months ago

In Trino, timestamp with time zone is rendered with its stored time zone. This is intentional, as the insertion time zone is considered to have meaning. Readers can choose to render it in a different time zone via at time zone. The session time zone is used for conversions between timestamp and timestamp with time zone inside the query engine, as described by the SQL specification.

There are roughly three phases of processing when reading timestamp values that need to be considered:

Changing the representation after the value is read from storage is problematic. It makes it impossible to observe the value as it was stored, with its original time zone. Changing the representation when the value is transported to the client requires changes to the protocol to allow clients to express such preference. Adjusting the rendering in the client requires each client to be able to do so independently.

As an aside, Iceberg's timestamp with time zone type is stored in UTC, and the timezone information is lost. From the Iceberg spec (https://iceberg.apache.org/spec/#primitive-types):

Timestamp values with time zone represent a point in time: values are stored as UTC and do not retain a source time zone (2017-11-16 17:10:34 PST is stored/retrieved as 2017-11-17 01:10:34 UTC and these values are considered identical).

sbrackenbury-teranet commented 6 months ago

First of all, thank-you @martint for taking the time to review and consider this issue and our PR.

After reviewing your comments and doing further testing of our PR. We recognize we need to revisit our code and the PR submission.

We still have a strong preference for the rendering/display implementation of timestamp with time zone on the server side and not client side precisely because, to your point, adjusting the rendering/display on the client would be challenging. We prefer a configurable behavior on the server side so as to be transparent to the client, while supporting backward compatibility.