substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.19k stars 155 forks source link

fix: use int64 instead of uint64 for PrecisionTimestamp(Tz) literal value #668

Closed Blizzara closed 3 months ago

Blizzara commented 3 months ago

This allows timestamps to refer to time before epoch, and aligns with other systems (Spark, DataFusion/Arrow, DuckDB, Postgres, Parquet at least)

BREAKING CHANGE: PrecisionTimestamp(Tz) literal's value is now int64 instead of uint64

In #659 I created this PrecisionTimestamp message to include the precision, but unfortunately copied the value's type as-is, not realizing the unsignedness is a problem.

jacques-n commented 3 months ago

I think we should probably update content here: https://substrait.io/types/type_classes/

So that it is very clear the valid range of timestamps. Right now it is much more vague than other types.

Blizzara commented 3 months ago

I think we should probably update content here: https://substrait.io/types/type_classes/

So that it is very clear the valid range of timestamps. Right now it is much more vague than other types.

Would you have a proposal? Happy to include in this PR if so, otherwise I’d suggest doing it as followup to get this break in as soon as possible.

(Though I see I need to update the page you linked for uint->int, thanks!)

Blizzara commented 3 months ago

@westonpace @jacques-n et al, any further comments or would this be good to go? (I'd like to get it in next release if possible..)

westonpace commented 3 months ago

So that it is very clear the valid range of timestamps. Right now it is much more vague than other types.

I seem to recall this being intentional. The PR to introduce precision timestamps originally stated A naive timestamp within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999] but this is not possible with nanosecond resolution on systems that store timestamps as 8 byte integers (some store them as a pair of 8 byte integers for second / subsecond). Rather than make the range based on the precision we decided to just be vague about the range.