prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.89k stars 5.32k forks source link

Equality semantics of TIMESTAMP WITH TIME ZONE type can cause inconsistent behavior #23252

Closed bikramSingh91 closed 1 month ago

bikramSingh91 commented 1 month ago

Equality semantics of TIMESTAMP WITH TIME ZONE type can cause inconsistent behavior

Context: TIMESTAMP WITH TIME ZONE (further abbreviated as TS_w_TZ) logical type is backed by BIGINT physical type. The timestamp values are stored in memory as 64-bit integers using an encoding that doesn't allow for direct comparisons of these integers. It physically packs two integers in a single 64-bit word, using 12 bits for timezone ID, and 52 bits for a millisecond-precision timestamp in UTC.

Therefore, two TS_w_TZ values are compared using only their UTC time, ignoring the timezone information. This means that 4 PM EST would compare as equal to 1 PM PST. https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/TimestampWithTimeZoneOperators.java#L79

While this may seem reasonable at first glance, it leads to inconsistencies in certain scenarios. For example:

Consider the following:

with a as (select from_iso8601_timestamp(x) as ts from unnest(array['2024-01-01T04:01:02-04:00', '2024-01-01T01:01:02-07:00']) as t(x))
select a.ts, hour(a.ts) from a;

 ts                             | _col1 
--------------------------------+-------
 2024-01-01 04:01:02.000 -04:00 |     4 
 2024-01-01 01:01:02.000 -07:00 |     1 

Now, if we join this table with itself on the "ts" columns as the join key, we should ideally expect 4 results, which are all possible combinations: (a, a), (a, b), (b, a), (b, b) as follows:

  ts                             | _col1 | ts_(1)                         | _col3 
--------------------------------+-------+--------------------------------+-------
 2024-01-01 04:01:02.000 -04:00 |     4 | 2024-01-01 04:01:02.000 -04:00 |     4 
 2024-01-01 04:01:02.000 -04:00 |     4 | 2024-01-01 01:01:02.000 -07:00 |     1 
 2024-01-01 01:01:02.000 -07:00 |     1 | 2024-01-01 04:01:02.000 -04:00 |     4 
 2024-01-01 01:01:02.000 -07:00 |     1 | 2024-01-01 01:01:02.000 -07:00 |     1 

However, the actual result we get is (a, a), (a, a), (b, b), (b, b) as follows:

with a as (select from_iso8601_timestamp(x) as ts from unnest(array['2024-01-01T04:01:02-04:00', '2024-01-01T01:01:02-07:00']) as t(x))
select a.ts, hour(a.ts), b.ts, hour(b.ts) from a, a as b where a.ts = b.ts;

 ts                             | _col1 | ts_(1)                         | _col3 
--------------------------------+-------+--------------------------------+-------
 2024-01-01 04:01:02.000 -04:00 |     4 | 2024-01-01 04:01:02.000 -04:00 |     4 
 2024-01-01 04:01:02.000 -04:00 |     4 | 2024-01-01 04:01:02.000 -04:00 |     4 
 2024-01-01 01:01:02.000 -07:00 |     1 | 2024-01-01 01:01:02.000 -07:00 |     1 
 2024-01-01 01:01:02.000 -07:00 |     1 | 2024-01-01 01:01:02.000 -07:00 |     1 

This happens because the planner has the following reasonable assumption and therefore moves eval of the hour() function after the join:

Given two values a and b and a deterministic function f

if                a == b is true

then    f(a) == f(b) should also be true.

However this does not hold true when a and b are TS_w_TZ

   4 PM EST == 1 PM PST 

hour(4 PM EST) != hour(1 PM PST) => 4 != 1

Possible fixes:

  1. Change equality semantics to consider values with equal UTC time but different timezones as different (such as 4 PM EST as 1 PM PST)

    • This will be in sync with other functions like hour() that consider the timezone when operating on a value of this type
    • Furthermore, if required, we can introduce a new equivalence function that only compares the UTC time or recommend a pattern that extracts the UTC time from it to maintain current use cases.
    • However, this would mean that we can no longer compare ( < or > ) two values because ordering tz ids does not make sense. So we can just mark this type as not comparable like we do with Maps
  2. Change the function like hour() to disregard the timezone and only return values as if reading the UTC time. This will make these functions in sync with the current comparison and hash operators. However, it can be confusing to see hour(1 PM PST) = 5

  3. Keep things as it is. We embrace the inconsistencies.

Question about Origin:

In the spirit of “Respecting what came before”, it would be great if someone can share the historical context about this and help us understand why this comparison semantic was set up this way. Was it a deliberate design choice or an oversight? Were there any specific use cases or requirements that led to this implementation?

Also it would be great if someone can share whether they encountered similar inconsistencies in the wild and how they handled them.

Thanks @mbasmanova for finding this inconsistency. This originally came up as a follow up to https://github.com/facebookincubator/velox/issues/10338

mbasmanova commented 1 month ago

CC: @kaikalur @feilong-liu @tdcmeehan @aditi-pandit @amitkdutta @arhimondr @rschlussel

mbasmanova commented 1 month ago

CC: @Yuhta @kgpai @kagamiori

Yuhta commented 1 month ago

Keep things as it is. We embrace the inconsistencies.

I don't think this is an option. The result shown in the join example is already incorrect, not just inconsistent.

Another option is to mark functions like hour to be non-deterministic, since they use information (last 12 bits) that are not visible to equality on the input arguments, so conceptually they are using some external information other than input arguments to determine the output, which matches the definition of non-deterministic function (i.e. ∃ a b, a = b → f(a) ≠ f(b))

kaikalur commented 1 month ago

Keep things as it is. We embrace the inconsistencies.

I don't think this is an option. The result shown in the join example is already incorrect, not just inconsistent.

Another option is to mark functions like hour to be non-deterministic, since they use information (last 12 bits) that are not visible to equality on the input arguments, so conceptually they are using some external information other than input arguments to determine the output, which matches the definition of non-deterministic function (i.e. ∃ a b, a = b → f(a) ≠ f(b))

Marking them non-detetministic sounds like a good compromise

aditi-pandit commented 1 month ago

+1. Marking non-deterministic is a good compromise.

mbasmanova commented 1 month ago

I'm not sure if helps much to make 'hour' function non-deterministic. This won't help with wrong join results. it would also require making cast(ts_w_tz as varchar), to_iso8601(ts_w_tz) and other expressions non-deterministic, which is strange.

Let's considering changing equality semantics to compare both timestamp and time zone and mark this type as non-orderable instead. Users can always switch to_unixtime(ts_w_tz) to work with just the timestamp.

kaikalur commented 1 month ago

One something is non-deterministic - there is no wrong result lol - only non-deterministic results

mbasmanova commented 1 month ago

One something is non-deterministic - there is no wrong result lol - only non-deterministic results

Sure, but the join result will continue to be wrong even if 'hour' and other functions are made non-deterministic.

tdcmeehan commented 1 month ago

The problem with making timestamp with timezone not orderable is that these are supposed to represent points in time, which are orderable. I would also find it unexpected for two values to be considered distinct if they differ solely by the timezone information, but represent the same point in time. So I'm wondering if the error is with the hour function, which should instead be returning the hour in UTC.

See Postgres which just returns the UTC hour: select extract(hour from '2024-01-01T04:01:02-04:00'::timestamptz), extract(hour from '2024-01-01T01:01:02-07:00'::timestamptz) => 8,8 (extract hour is equivalent to the hour function in Presto).

The SQL spec also seems to suggest that the time represented is UTC (and this also implies equality semantics don't care about the timezone):

A datetime value, of data type TIME WITHOUT TIME ZONE or TIMESTAMP WITHOUT TIME ZONE, may represent a local time, whereas a datetime value of data type TIME WITH TIME ZONE or TIMESTAMP WITH TIME ZONE represents UTC.

Yuhta commented 1 month ago

@mbasmanova So even with non-deterministic function, the optimizer will still move their evaluation after the join? If that is the case, then the mark as non-deterministic is not an option. Or we can fix that rule only move deterministic functions.

bikramSingh91 commented 1 month ago

"Timestamp with Timezone" type contains two important parts, the timestamp and the timezone. If we were to ignore the timezone, then it would be more appropriate to use a simple Timestamp type and store the timestamps in UTC. However, since the "Timestamp with Timezone" type exists, it suggests that both parts are important and should be considered equally. Currently, there is inconsistent behavior in functions that operate on this type, with some considering the timezone and others not. Based on the discussion here we seem to have narrowed down to two options to address this inconsistency:

  1. Change all operations to equally consider both the timestamp and the timezone.
  2. Change all operations to ignore the timezone.

If we choose option 2, then it would more straightforward to either get rid of the "Timestamp with Timezone" type or rename it to "Timestamp in UTC". On the other hand, if we want to preserve the "Timestamp with Timezone" type, then option 1 seems like a more consistent solution.

Also, another way of looking at the SQL spec can be that it was designed to specify the timezone for timestamp storage, ensuring uniformity across its implementations. However, the presence of the timezone component in the data type itself means that it simply stores the timezone and is therefore self explanatory and need no further clarification.

mbasmanova commented 1 month ago

To add to @bikramSingh91's point, it is reasonable to assume that when a == a', then it is Ok to replace a with a' anywhere. If that's the case, then we can canonicalize TS_w_TZ values to always use UTC for time zone. If we do that, then what's the point of storing the TZ in the value? We can simplify further and use BIGINT value to store just the UTC timestamp.

kaikalur commented 1 month ago

To add to @bikramSingh91's point, it is reasonable to assume that when a == a', then it is Ok to replace a with a' anywhere. If that's the case, then we can canonicalize TS_w_TZ values to always use UTC to time zone. If we do that, then what's the point of storing the TZ in the value? We can simplify further and use BIGINT value to store just the UTC timestamp.

This is what we did in metanautix and I think Dremel does similar thing as well.

tdcmeehan commented 1 month ago

In order to answer the question of why not just use timestamp, it might help to consider two different use cases: I am operating an HR system for a global corporation which notes times when people must come into the office, and I am collecting data from sensors across the globe and need to analyze information from these sensors.

For the HR system: I need to note when I expect workers to come into the office without a note explaining absence, and I need to note when workers punch in for work. In order to note the latest time when workers must come into the office, I must pick a time. However, I operate a system that is global, and instead of choosing a time per each region, I would probably prefer to pick a time without specifying the zone (e.g., I expect workers to punch in no later than 9am each day, regardless of timezone). This time would not be a point in time, because without the timezone, it may actually represent dozens of points in time when you enumerate all the timezones it could be associated with.

As workers come in across the world, we convert the expected time, which has no time zone and is not a point in time, to the timestamp with time zone of the worker’s local time (which is a point in time), and then see who came in later than 9am. In other words, I really want a timestamp that cannot be interpreted as a point in time, but can be converted to a point in time when necessary for comparison purposes. That’s the role that TIMESTAMP serves.

Now, let’s consider the system that collects sensor data from across the world. We need to build resilience around a thundering herd problem from the sensor data. In order to do this, we need to understand when the problem occurs, and from where the sensors are sending the data. We collect all the data into just one data center. Each datum from the sensor is associated with a point in time, meaning, there is just one moment that this occurred in across the whole world, even though there are dozens of timezones across the whole world.

Because there is just one data center, but the sensors are global, it’s important to order the moments that the sensors emitted data in order to discover when bursts of information occurred. If the data were not orderable, then we wouldn’t be able to sort these moments (for example, to visualize them on a graph). Consider if the sensors go into Kafka topics, and there is an outage or compaction—the ingestion time could differ significantly from the timestamp of the event, and you would be required to sort.

TIMESTAMP would not be appropriate for this use case, because even though we store them as offsets from UTC, we’ll always normalize them to the user’s session time. This is because TIMESTAMP doesn’t represent a point in time, so to give it this “timezone-less” quality, we always normalize it to UTC, then always normalize it back to the local time, even though there may be a difference between the time of ingestion vs. time of query. So you’d need to manually adjust this somehow. So TIMESTAMP WITH TIMEZONE provides value here because its semantics ensure you can losslessly ensure that the timezone of the source vs. the timezone of the query may be different.

While the valid use cases for TIMESTAMP (aka TIMESTAMP WITHOUT TIMEZONE in the SQL spec) seem contrived or relatively rare, they do exist, and people may be relying on them. Likewise, the orderability property of TIMESTAMP WITH TIMEZONE is central to the utility of this type—we must be able to order these values regardless of timezone, because these are points in time, and ordering the timestamps is required to construct timelines of events.

Finally, why do we bother to store the timezone? The spec provides functions which allow you to query the timezone associated with the timestamp (e.g. EXTRACT (TIMEZONE_HOUR …). So you’d need to store it to support that.

Yuhta commented 1 month ago

@tdcmeehan For order of TIMESTAMP WITH TIMEZONE, we can convert them to UTC then sort on UTC. There is no need to impose an order that is not well defined.

bikramSingh91 commented 1 month ago

@tdcmeehan Thank you for your detailed explanation of the use cases for TIMESTAMP WITH TIMEZONE. The scenarios you described are certainly valid, deserving support, and highlight the significance of this type.

What I was trying to highlight earlier is that that while the comparison semantics you mentioned are appropriate within those specific scenarios, applying the same logic to other functions or operations involving this type exposes the inconsistencies that we have been discussing. What I am simply proposing is that to address these inconsistencies, we need to establish semantics that consistently interpret this type across all scenarios, while ensuring they enable support for those use cases.

tdcmeehan commented 1 month ago

@bikramSingh91 agreed, and just to make it more clear, I'm arguing that given there's two options (change the equality and comparison semantics of the type, or change the functions to be consistent with the type), the former brings us out of line with the spec and other SQL systems, and also renders the type less useful, and in my view we should examine how to make the function semantics consistent. It seems there's two options, one proposed by @Yuhta (make functions that alter the results based on the TZ non deterministic), and simply converging on UTC for accessor functions.

tdcmeehan commented 1 month ago

By the way, for historical context and discussion, please see #7122 (discuss primarily TIMESTAMP, but also goes into rationale behind TIMESTAMP WITH TIMEZONE type).

Yuhta commented 1 month ago

@tdcmeehan As far as I know, the TIMESTAMP WITH TIMEZONE in Presto is not the same as any other systems (they do not store timezone with data per row, just use some global/table/column specific timezone), which means there is no alignment with other systems in the first place.

The nondeterministic workaround need to be combined with a fix to join optimization rule to make the join result correct. There are probably other places as well need fix to limit us from moving nondeterministic function around.

Yuhta commented 1 month ago

@tdcmeehan I read the issue #7122 and see TIMESTAMP WITH TIMEZONE in Presto is equivalent to ZonedDateTime in Java. But ZonedDateTime.equals comparing both UTC offset and the time zone associated with it: https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/java/time/ZonedDateTime.java#L2181-L2192

tdcmeehan commented 1 month ago

I know that Oracle supports this type in this way (the offset is included in the value), and Postgres just foregoes it by normalizing to UTC. But the property that I think is most important that is shared between them is orderability--both of these systems retain orderability of the type. And that's what I meant about divergent behavior, it would be a major behavior change to make this type not orderable (given that this type represents a point in time).

Comparisons of these types in Java is complicated and confusing. equals and compareTo compare objects (i.e. are the object contents equal, not semantically equal), but semantically to check if two ZoneDateTimes are equal, you use isEqual and isBefore/isAfter. This would be the behavior we want in Presto--TIMESTAMP WITH TIME ZONE represents a point in time, which means two equivalent moments in different time zones are the same.

Yuhta commented 1 month ago

@tdcmeehan How does Oracle handles the optimization rules involving hour and things like that? Do they mark them as non-deterministic and limit moving of these functions? So far I think there are 3 options:

  1. Change the comparison of TIMESTAMP WITH TIME ZONE to compare both epoch and time zone (I think this one makes most of the sense personally)
  2. Change functions like hour to return UTC part instead of zoned time (zone data would be useless and wasted space in this case)
  3. Mark functions like hour as non-deterministic and fix optimizer to limit the move of such functions (kind of counter-intuitive)
tdcmeehan commented 1 month ago

It's not necessarily true that the zoned part is wasted space. Practically, there are some esoteric uses, outlined below:

I'll check on Oracle behavior when I'm back at the desk, but it's documented to be UTC. (Edit: confirmed with a sample query SELECT EXTRACT(HOUR FROM TIMESTAMP '2024-01-01 04:01:02 -04:00'), EXTRACT(HOUR FROM TIMESTAMP '2024-01-01 01:01:02 -07:00') FROM DUAL; => 8, 8. Oracle seems to parse the timestamp to infer it's a TsTz.)

Yuhta commented 1 month ago

@tdcmeehan In that case cast, timezone_hour, and extract all need to be marked as non-deterministic since they depend on time zone.

Yuhta commented 1 month ago

So option 2 and 3 should be combined together, a TIMESTAMP WITH TIME ZONE function is either non-deterministic or depending on only UTC epoch. So the options are:

  1. Change the comparison of TIMESTAMP WITH TIME ZONE to compare both epoch and time zone, or
  2. Examine all functions take TIMESTAMP WITH TIME ZONE as input, divide and change them into 2 groups:
    • Operates on only UTC part and ignore time zone
    • The rest should be marked as non-deterministic, and fix optimizer to limit the move of all non-deterministic functions
tdcmeehan commented 1 month ago

Agreed, and it also seems that option 1 makes a very poor tradeoff by sacrificing ordering to solve this consistency issue. I think option 2 is the more straightforward fix.

bikramSingh91 commented 1 month ago

So option 2 and 3 should be combined together, a TIMESTAMP WITH TIME ZONE function is either non-deterministic or depending on only UTC epoch. So the options are:

  1. Change the comparison of TIMESTAMP WITH TIME ZONE to compare both epoch and time zone, or
  2. Examine all functions take TIMESTAMP WITH TIME ZONE as input, divide and change them into 2 groups:

    • Operates on only UTC part and ignore time zone
    • The rest should be marked as non-deterministic, and fix optimizer to limit the move of all non-deterministic functions

To add another data point: I looked at some other systems (not discussed till now):

Snowflake has the same comparison semantics as current presto. I don't have a working instance of Snowflake so I could not check functions that operate on them and the documentation does not explicitly state anything either.

Spark also has the same semantics as current presto. However, its does mention function semantics similar to "Option 2":

The time zone offset of a TIMESTAMP WITH TIME ZONE does not affect the physical point in time that the timestamp represents, as that is fully represented by the UTC time instant given by the other timestamp components. Instead, the time zone offset only affects the default behavior of a timestamp value for display, date/time component extraction (e.g. EXTRACT), and other operations that require knowing a time zone, such as adding months to a timestamp.

While BigQuery and mysql do not have this type.

Looks like Postgres summed up their frustration with the inherent inconsistency in this type pretty well in this quote from their documentation (https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-DATETIME). This is probably why they chose to store everything in UTC:

The type time with time zone is defined by the SQL standard, but the definition exhibits properties which lead to questionable usefulness.

So far it seems like option 2 is more favorable as @tdcmeehan suggested.

mbasmanova commented 1 month ago

It might be tempting to come up with some workarounds for specific scenarios, but let's make sure we get fundamentals right. The fundamental property relevant here is that a == a' implies that a can be replaced with a' and a' can be replaced with a anywhere. TS_W_TZ type doesn't support this property, while many places in the optimizer and the engine assume this property holds for all types.

For some use cases TS_W_TZ value is interpreted as point-in-time, while for other uses cases it is "point-in-time observed in a particular time zone". These definitions contradict each other hence cannot co-exist. We need to pick one.

Assuming we do not want to make this property optional, introduce type metadata that tell whether it holds or not and change all places that assume this property to not do so, we need to fix the type.

Assuming further that we want to support use cases which define the values as "point-in-time observed in a particular time zone" we need to fix equality to consider time zone. Essentially, this type would be equivalent to a struct with 2 fields. Doing so would fix TS_W_TZ to hold the fundamental property discussed here. Doing so would also continue to allow all existing use cases. The cases where values are just points in time would extract the unix epoch and use that for comparison, sorting, joining, grouping, etc.

tdcmeehan commented 1 month ago

The cases where values are just points in time would extract the unix epoch and use that for comparison, sorting, joining, grouping, etc.

@mbasmanova do you mean the user would convert the TS_W_TZ to Unix epoch time as a double or bigint via to_unixtime?

mbasmanova commented 1 month ago

@mbasmanova do you mean the user would convert the TS_W_TZ to Unix epoch time as a double or bigint via to_unixtime?

@tdcmeehan Tim, yes, that would be a way.

tdcmeehan commented 1 month ago

While the argument that the defined semantics of this type break conventions held by the optimizer and eval makes sense, I don’t think that’s the place we should start from. Instead, we should start in the opposite direction: what is this type defined to be, and what are expectations from users? Then we can work backwards and evaluate how to solve the problem.

As I mentioned in my lengthy comment earlier, there are two time types defined by the standard and supported by Presto: a TIMESTAMP [WITHOUT TIME ZONE] and a TIMESTAMP WITH TIME ZONE. By definition, a TSw/oTZ without timezone is timezone-less, and hence not a point in time, whereas a TSw/TZ represents a point in time. So there is only one SQL standard-defined type to support a point in time.

Like all datetime types in the standard, they are all comparable: Items of type datetime are comparable only if they have the same <primary datetime field>s. Likewise, the industry, when distinction is made between TSw/oTZ and TSw/TZ, also ensures they are comparable. So it’s reasonable to expect that if a user wants to choose Presto, they might also expect that this type retains orderability.

The argument to appeal to the spec and other systems isn’t just made for convention’s sake. The reason why, when the spec states something clearly, we would like to follow it is because other database systems tend to move in this direction. For example, Spark recently spent a lot of effort working on ANSI compliant syntax and semantics. Why did they invest so much effort into this initiative? Because other systems, such as data warehouses, hew to the spec, and this makes it easier to use Spark. Migrations to Spark become easier when the syntax is less divergent. The implication is Spark SQL prior to this initiative presented a barrier for entry for people who would like to use Spark. Presto has historically given deference to the spec, and I think we should continue that. It doesn’t mean we can’t innovate in the language where appropriate, but the bar should be really high for deciding to break with the spec.

Aside from the general concern I have from just ignoring the industry and the ANSI standard, I also have a few practical concerns:

1) Most systems that have the TSw/TZ type make this type orderable. It is one matter to ask users to rewrite queries, but an entirely different matter to rewrite business intelligence tools, ORMs, graphing tools and systems like Ibis to add special handling for this proposed non-orderable TWw/TZ type. I worry it will present a long-lasting barrier for entry to Presto. 2) From the entire perspective of our direction of unifying the eval engine and providing a common core that is reusable across many other systems, it seems this non-orderable TSw/TZ type would make it that much harder to integrate with Velox, since it is so divergent from any other system we have enumerated in this issue. 3) Consider that the spec allows for optional precision to be added to the timestamp, for example, to support nanosecond level precision. This means in order to sort, users would have to infer the physical representation of the time in the timestamp, and it might not be feasible for high precision timestamps. For example, we’re already not conforming to the Iceberg spec by truncating microsecond timestamps to milliseconds, and V3 now supports nanosecond timestamps. Trino goes to picosecond timestamps. How would these be represented for ordering, given that they can’t be represented in 64 bits? Timestamp won't work, that will represent everything in the local time zone. 4) Another problem with surfacing the physical representation of the timestamp for ordering purposes is that most likely this type will be expected to support type widening. For example, it might be reasonable to support the widening of a timestamp type from milliseconds to nanoseconds without having to rewrite previous data files. Yet how this would map to queries who use the Unix epoch time to sort a value is an unknown and likely a future source of frustration.

A final concern I have with treating TSw/TZ is that I just worry how intuitive it is to think of timestamps in this way. It is elegant from a database engineering perspective, but cumbersome from the perspective of actually using the type. I suspect the additional friction from making this type not orderable would drive people to simply avoid the type.

@mbasmanova, I think it’s useful to distinguish between point-in-time vs. point-in-time observed in a particular time zone. Given the above, I think the more important use case this type has is the former, because there’s no alternative type that supports this in the standard. Given the choice between these two definitions, I think we should prefer the former. So, my preference is to either:

1) Go the Postgres way, and just forego storage of the timestamp offset. This would make CAST and EXTRACT behave a little wonky, but we have good company with a well reputed and popular open source database, and we can document this as they have. I also think this approach is still compatible with the spec, although without documentation it’s not intuitive. 2) Take a more exhaustive look at what @Yuhta mentioned for option 2: take stock of the functions that use the timezone, look into the optimizer and see how we can faithfully preserve optimizations that leverage these functions (perhaps by marking them non-detemninistic and see why we’re failing there).

Yuhta commented 1 month ago

The cleanest way to represent TIMESTAMP WITH TIME ZONE is ROW<epoch:BIGINT, timezone:???>, then we can order it the same as a ROW type.

tdcmeehan commented 1 month ago

@Yuhta would the timezone component be orderable? I thought the issue was that these are inherently non-orderable, since they represent political boundaries which can't be ordered.

mbasmanova commented 1 month ago

I think it’s useful to distinguish between point-in-time vs. point-in-time observed in a particular time zone. Given the above, I think the more important use case this type has is the former, because there’s no alternative type that supports this in the standard. Given the choice between these two definitions, I think we should prefer the former.

@tdcmeehan Tim, I don't have a strong opinion on which definition is better. I'm OK with either one as long as we are not trying to use a single type to represent both.

Go the Postgres way, and just forego storage of the timestamp offset.

That works.

I agree that we should not deviate from the SQL spec without a really good reason. However, it is my impression that the spec might be buggy in this case. I can't believe the spec defines a type that violates the fundamental principle I described earlier. BTW, I do not see TS_w_TZ type defined in Spark. TS_w_TZ seems to be specific to Presto. Also, why can't we just order time zone IDs or names?

tdcmeehan commented 1 month ago

We can sort by the offset from UTC. But this would lead to the same behavior noted by @bikramSingh91. If we want ordering to be consistent with equality, and we begin to fix equality to compare the timezones, then timestamps are no longer points in time but rather points in time in a particular timezone.

If we were to ensure ordering based off of something like the text of the time zone, it would lead to strange behavior. For example, I have a time travel query in Iceberg, and I want the version of the table for time X in UTC, but in the time travel query I include a timezone that is ordered after UTC.

e.g. SELECT * FROM ctas_nation FOR SYSTEM_TIME BEFORE TIMESTAMP '2023-10-17 17:29:46.822 UTC' -- an event happened at this time, and I need to ensure that I only look at the version of the table prior to that time SELECT * FROM ctas_nation FOR SYSTEM_TIME BEFORE TIMESTAMP '2023-10-17 13:29:46.822 America/Los_Angeles' -- this includes the problematic time, even though it's the same point in time To avoid this problem, we need to know how Iceberg stores timestamps, and normalize for that, which is cumbersome.

There are similar concerns on filtering--you would need to know the timezone of the event you want to filter on in order to yield an accurate filter.

mbasmanova commented 1 month ago

timestamps are no longer points in time but rather points in time in a particular timezone.

This is the core issue. We need to first decide what is the definition of TS_w_TZ type. I believe we can't have it both ways.

Yuhta commented 1 month ago

@tdcmeehan Why can't we just do SELECT * FROM ctas_nation FOR SYSTEM_TIME.epoch BEFORE TIMESTAMP '2023-10-17 17:29:46.822 UTC'.epoch? That's way better IMO.

tdcmeehan commented 1 month ago

@Yuhta periods are defined in the spec as well, so it goes back into doing things our own custom way or doing things according to a well known specification. While time travel/periods have a lot of custom syntax among other engines, because it is defined in the spec, we expect that it will converge over time, and by far other engines are converging toward using timestamps to define the system time period. Timestamp with timezone unambiguously defines a point in time already, so there's not a whole lot of added value from asking users to convert to epoch time.

Yuhta commented 1 month ago

@tdcmeehan Like you already pointed out the definition of order on TS_w_TZ is ambiguous and under specified. Rather make it clean by converting TS_w_TZ to their equivalence class representation (epoch), instead of imposing the equivalence class order on TS_w_TZ itself, which makes a lot of functions behave weirdly (non-deterministic).

tdcmeehan commented 1 month ago

I mentioned that the definition of order on timezones is ambiguous, but we don't order timezones in the spec. The spec defines TSw/TZ to be a point in time, it represents UTC, and it is orderable, so this is enough clarity in the spec for the purposes of table period syntax. What's ambiguous are the properties of EXTRACT (and analogous functions such as hour()) and CAST (am I missing any others?).

Yuhta commented 1 month ago

If we must be aligned with ANSI, I think there is a third option: Store only UTC epoch in TS_w_TZ and let extract and cast takes timezone as extra parameter. This way it is fully ANSI compliant (TS_w_TZ means point in time) and there is no ill-defined functions, and that is the way everyone out there except Oracle is doing.

tdcmeehan commented 1 month ago

I think the planner bug is not as described in the original issue. If you’ll notice, the hour() function is consistent with the value itself.

 ts                             | _col1 | ts_(1)                         | _col3 
--------------------------------+-------+--------------------------------+-------
 2024-01-01 04:01:02.000 -04:00 |     4 | 2024-01-01 04:01:02.000 -04:00 |     4 
 2024-01-01 04:01:02.000 -04:00 |     4 | 2024-01-01 04:01:02.000 -04:00 |     4 
 2024-01-01 01:01:02.000 -07:00 |     1 | 2024-01-01 01:01:02.000 -07:00 |     1 
 2024-01-01 01:01:02.000 -07:00 |     1 | 2024-01-01 01:01:02.000 -07:00 |     1 

hour()=4 only shows up in tuples where it should equal 4, and hour()=1 only shows up in tuples where it should equal 1. The problem is the timestamp values are not being joined correctly. You’ll notice in the table above, we should also be getting every permutation of the timestamps, but instead we have only two permutations.

Oracle is able to do this correctly (although you'll note that they too normalize the hour() function to UTC):

Screenshot 2024-07-29 at 11 31 26 AM

I thought it was important to highlight this because I think this discussion has leaned on this finding from the original issue, but I don't think that's really the problem. The problem is evident in the plan. Do a show explain plan for the query and note the join:

InnerJoin[PlanNodeId 16][("from_iso8601_timestamp" = "from_iso8601_timestamp_8")][$hashvalue, $hashvalue_29] => [from_iso8601_timestamp:timestamp with time zone]

i.e., the output is just one assignment, the probe's TS column. But change the join condition to join on the UTC epoch time, and you'll see a different plan, regardless of whether or not you use any of the nondeterministic accessor functions on the timestamps:

- InnerJoin[PlanNodeId 16][("to_unixtime" = "to_unixtime_54")][$hashvalue, $hashvalue_56] => [from_iso8601_timestamp:timestamp with time zone, field_0:varchar(1), from_iso8601_timestamp_16:timestamp with time zone, field_12:varchar(1)]

Both timestamp columns show up. So we can more accurately describe the bug mentioned in the original issue as, the optimizer assumes that two values which are equal also are encoded/represented the same way, i.e. an equality between two values means that they are identical. And, making this assumption, the optimizer will make the plan more efficient by only choosing the probe side for the assignment. The problem is this assumption holds true for every type in Presto right now except TSw/TZ.

I'm not proposing a solution here yet, I just thought it was important as I think we kind of followed a red herring.

mbasmanova commented 1 month ago

@tdcmeehan Tim, thank you for highlighting this. I thought this is obvious, but it probably wasn't. That's why I'm trying to focus on the fundamental property that is assumed throughout the engine (including the optimizer) and that is not true for the TS_w_TZ type. I feel it would be impractical to remove this assumption, hence, I suggest that we change the type to provide this property.

tdcmeehan commented 1 month ago

@mbasmanova I believe this bug was introduced by https://github.com/prestodb/presto/commit/f6c22523a6e007460d74cffd2ddb6c2f81f939d9. I'm wondering if you can cite other examples of the optimizer or eval relying on equality == identity? I just want to make sure we are not overestimating the cost before we choose a user facing behavior change to fix the issue.

mbasmanova commented 1 month ago

@tdcmeehan Tim, I do not have specific examples. it just seems such a basic assumption that I expect it is being implicitly made in multiple places.

Yuhta commented 1 month ago

@tdcmeehan Good findings, so there is nothing to do with functions, and tweaking function metadata no longer works here.

Regarding the equality vs identity, even it is not leverage widely in the current code base, it would be hard to prevent future work from making this assumption.

aaneja commented 1 month ago

I just thought it was important as I think we kind of followed a red herring.

There is a bug with how non-deterministic functions are handled in the planner too, which is what I highlight in https://github.com/prestodb/presto/issues/23318

Fixing this bug is needed as well

rschlussel commented 1 month ago

Sharing a summary of what the SQL 2023 spec has to say about timestamp with timezone and how Presto fits in with the (this is the same as what I shared internally with minor copy edits.).

Summary of the SQL 2023 Specification

All references here are for the 2023 spec 1) A timestamp with time zone has two components the "datetime primary", which is a time in utc, and a timezone offset (See 4.72 Datetimes).

2) To subtract one timestamp with time zone from another, you first convert the timestamps from their original timezones to session local timezone and cast to a timestamp without timezone (CAST ( DT AT LOCAL AS TIMESTAMP(MSP) WITHOUT TIME ZONE ). You then convert the timestamps to integer values as displacements from some implementation dependent start datetime and subtract the two to get the resulting interval (From section 6.44 General Rules 7 (a.ii, b.ii, c, d)) Rebecca's comment: I think it's done this way to handle subtracting a timestamp without timezone from a timestamp with timezone, but it's a very weird thing to do, and as Jimmy noted, you can have two UTC times map to the same local time, which makes the semantics here bizarre. Presto currently compares timestamps with timezones by getting the UTC milliseconds and comparing. I assume other DBs would do the same, but I haven't checked.

3) To compare two timestamps with timezones (equality, greater/less than, distinctness, sorting, etc.), you subtract one from the other and see if the resulting interval is 0, >0 or <0 (From section 8.2 under General Rules 6).

4) Two timestamps with Timezone can be equal without being identical. For example, as noted above the comparison formula doesn't consider the timezone, so timestamps can be considered equal even if they have different time zones. Two identical Timestamps with Timezones would have both the same UTC time and the same timezone offset (Section 9.10 Determination of identical values). This results in some non-determinism for operations that use timestamp with timezone as a key (see section 9.16 Potential sources of non-determinism). For example, distinct or group by. If you have two equal but not identical timestamps, you don't know which one will be the group by group or distinct value.

 For further evidence that this is what the spec means, you can look at what the spec says about character strings. Character strings, like timestamps with time zones, have a special definition of identical in section 9.10 that is different from not distinct. They are also consistently listed along with datetime with timezone in the lists of potential sources of non-determinism in section 9.16. That two values can be equal but not identical is explicitly spelled out for character strings quoted below (from Section 8.2 <comparison predicate> General Rules 3d):
 > Depending on the collation, two strings may compare as equal even if they are of different lengths or contain different sequences of characters. When any of the operations MAX, MIN, and DISTINCT reference a grouping column, the GREATEST and LEAST functions, and UNION, EXCEPT, and INTERSECT operators refer to character strings, the specific value selected by these operations from a set of such equal values is implementation-dependent (UV090). 

5) extract function - extract is a sql standard function for extracting the value of a component of a datetime type (Presto also has some convenience extraction functions like hour that are not part of the standard). It is defined in section 6.31 syntax rules 6 and 7 and general rule 4. You can extract a primary date time field or the time zone field if you are using a datetime with timezone type. According to the spec if you are extracting a primary date time field (any part of a timestamp that isn't the timezone), then you get that from the primary datetime field directly from the type you are using. It's a bit vague, but based on how timestamps are talked about in other places, I suspect that this means that for a timestamp with timezone, you would get the UTC hour and not the local time hour for that timezone. Note: this is different from what Presto currently does, which is return the hour in the specified timezone. Whatever we choose to do, we should ensure that extract and the convenience extraction functions have the same semantics.

6) Casting from Timestamp WITHOUT TIMEZONE to TIMESTAMP WITH TIMEZONE and vice versa is very clearly defined in 4.7.2 table 3 "Datetime data type conversions". For timestamp without timezone to timestamp with timezone you use the session timezone as the timezone: TV.UTC = SV − STZD; TV.TZ = STZD. For TIMESTAMP WITH TIMEZONE to TIMESTAMP WITHOUT TIMEZONE, you add the timezone offset to the utc value to get the local time at the specified timezone SV.UTC + SV.TZ. Note that Presto only behaves correctly here when legacy_timestamp=false, which is NOT the default.

7) Casting from TIMESTAMP WITH TIMEZONE to a character string (e.g. VARCHAR) is also well defined. In the default case it should look the same as the literal you would use to create a timestamp with timezone (e.g. '2024-07-31 10:31:00.000 America/New_York'). You can alternatively provide a template for what you want the datetime string to look like. (see section 6.13 General rules 11e and 12e)

How does presto fit in to these requirements

Presto's semantics generally follow the sql spec when legacy_timestamp is false except for the following cases

Note that Presto currently has legacy_timestamp enabled by default, which has very incorrect semantics https://github.com/prestodb/presto/issues/7122. We should consider changing the default to match the sql standard.

There was also a question about whether functions like hour on timestamp with timezones should be marked as non-deterministic. These aren't traditionally non-deterministic, as for identical input they produce deterministic output, but it is true that for equal but not identical input they can produce different output, and it's also true that some basic operations on timestamps that may be earlier in the plan (e.g. distinct) can be non-deterministic in that they can produce equal but not identical output for the same input, so it may be beneficial to mark timestamp with timezone functions as non-deterministic to prevent them from being pushed down too much. I would do whatever is most pragmatic here.

tdcmeehan commented 1 month ago

I opened a PR which addresses the bug with join in #23319, would appreciate any feedback.