substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.11k stars 147 forks source link

fix: include precision information in PrecisionTimestamp and PrecisionTimestampTZ literals #659

Closed Blizzara closed 5 days ago

Blizzara commented 1 week ago

BREAKING CHANGE: changes the message type for Literal PrecisionTimestamp and PrecisionTimestampTZ

The PrecisionTimestamp and PrecisionTimestampTZ literals were introduced in #594, and there was some discussion about their contents in https://github.com/substrait-io/substrait/pull/594#discussion_r1471844566. In their current form, they only contain a i64 value, which I believe is meant to be a precision-dependent number of fractional seconds since epoch. However, the literals don't contain the precision itself, so it's impossible to interpret a PrecisionTimestamp or PrecisionTimestampTZ literal without other information. This is in contrast to e.g. varchar, whose literal does specify the length, or decimal, whose literal specifies scale and precision. @westonpace curious for your thoughts since you were part of that original discussion - am I missing something or is this a bug?

This PR changes the Literal types for PrecisionTimestamp and PrecisionTimestampTZ to contain a PrecisionTimestamp message instead of a i64. That message then contains the i64 value as well as the i32 type. This is a breaking change, but given in their current form these literals aren't usable (unless I'm missing something), would that be okay?

Currently I used the same message for both PrecisionTimestamp and PrecisionTimestampTZ, but I can also make a copy for PTTZ if that'd be better for forwards-compatibility.

richtia commented 1 week ago

The epoch i64 value itself should be enough to indicate the precision. Any additional digits after seconds will be for fractional seconds (and used to determine precision).

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch

Does that make the comment from Weston more clear?

Blizzara commented 1 week ago

The epoch i64 value itself should be enough to indicate the precision. Any additional digits after seconds will be for fractional seconds (and used to determine precision).

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch

Does that make the comment from Weston more clear?

Not really, could you ELI5 how you get the precision from the i64? đŸ˜… ie how do you know which part of it indicates seconds?

richtia commented 1 week ago

Not really, could you ELI5 how you get the precision from the i64? đŸ˜… ie how do you know which part of it indicates seconds?

For example

1720200424 -> Friday, July 5, 2024 5:27:04 PM
1720200424123 -> Friday, July 5, 2024 5:27:04.123 PM (3 - millisecond precision)
1720200424123456 -> Friday, July 5, 2024 5:27:04.123456 PM (6 - microsecond precision)
1720200424123456789 -> Friday, July 5, 2024 5:27:04.123456789 PM (9 - nanosecond precision)

We stuck with either microsecond or nanosecond because that seems to be what most engines are using.

richtia commented 1 week ago

Makes sense. This should make it easier for users to understand. I imagine a few others would have eventually had confusion here as well

Blizzara commented 1 week ago

Thoughts on whether there should be one message (PrecisionTimestamp) shared across both, or two messages (PrecisionTimestamp and PrecisionTimestampTZ)? The types have separate messages, so maybe using separate messages here as well would be better to keep things aligned? But I dunno if it matters.

westonpace commented 1 week ago

Thoughts on whether there should be one message (PrecisionTimestamp) shared across both, or two messages (PrecisionTimestamp and PrecisionTimestampTZ)?

I think one message is fine. E.g. we use i64 for int64 and time (and the legacy timestamp). We use bytes for binary, fixed_binary, and uuid. The reader / writer can always distinguish easily between timestamp and timestamp_tz using the message code (34/35)

westonpace commented 1 week ago

This is a breaking change so I think we'll need 2 SMC approvals. @EpsilonPrime @vbarua @jacques-n