substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.15k stars 150 forks source link

Support for an interval type covering both months and (sub)seconds #664

Closed Blizzara closed 3 weeks ago

Blizzara commented 1 month ago

Substrait currently defines two interval types, IntervalYearToMonth which contains a number of years and months, and IntervalDayToSecond.

Neither of these types is capable of expressing full intervals that combine years, months, days, minutes, seconds, etc. (For YearToMonth that's probably clear; for DayToSecond one could in theory use large numbers for days but given the number of days in a month/year is not unique, there is a semantic difference between an interval of "40 DAYS 1 MINUTE" and "1 MONTH 10 DAYS 1 MINUTE"

Would you be open to adding an third interval type that combines three numbers: months (i32), days (i32), (micro)seconds (i32 for micro or decimal for seconds)?

westonpace commented 1 month ago

Should we make the subsecond precision configurable? Or we could just use nanoseconds (there is not a "range problem" since this only needs to store up to 1 day's worth). If we choose microseconds then we still aren't compatible with Arrow / DataFusion.

We also need to define the range of the data type. The year/month interval has a range of -10,000 to 10,000 years so I think we could just reuse that (should be doable with a u32 month).

vbarua commented 1 month ago

I've looked into this previously. Posting a summary of what I found in terms of engine behaviours below. One of the big challenges I've found around this was the semantics of various arithmetic operations with interval types. Having a combined interval type more closely matches some engines, but then the actual permitted opertaions with those intervals and the return types can vary wildly.

Intervals In the Wild

SQL Spec

The SQL 98 spec defines two categories of intervals: Year-month intervals (i.e. YEAR, MONTH) Day-time intervals (i.e. DAY, HOUR, MINUTE, SECOND)

Year-month intervals are only comparable to other year-month intervals. Day-time intervals are only comparable to other day-time intervals.

Operations involving values of type interval (i.e. +) require that the interval types be mutually comparable. This means that per the spec you cannot add year-month intervals to a day-time intervals as they are not comparable.

MySQL

MySQL does not have an interval type, however specific operations allow for declarations of interval literals. For example DATE_ADD:

SELECT DATE_ADD('2018-05-01', INTERVAL 1 DAY);
> 2018-05-02

sqlite

sqlite does not have an interval type, however specific functions can specify interval-like modifiers:

SELECT datetime('now', '+5 years', '-90 seconds');

Trino

Trino has two interval types:

Which cannot be combined:

SELECT INTERVAL '1' MONTH + INTERVAL '1' HOUR
> Cannot apply operator: interval year to month + interval day to second

Snowflake

Snowflake does not have an interval column type but does allow constant intervals. These intervals can combine year-month and day-time values. Note though that in Snowflake the individual values are added or subtracted in the order listed. So:

SELECT TO_DATE ('2019-02-28') + INTERVAL '1 day, 1 year';
> 2020-03-01
SELECT TO_DATE ('2019-02-28') + INTERVAL '1 year, 1 day';
> 2020-02-29

return different values, in this case due to leap year differences.

Postgres

Postgres has a single interval type which allows for combinations of year-month and day-time intervals.

SELECT INTERVAL '1 MONTH' + INTERVAL '2 DAY' + INTERVAL '1 SECOND'
> {"months":1,"days":2,"seconds":1}

This does not technically conform to the spec.

DuckDB

DuckDB has a single interval type which allows for combinations of year-month and day-time intervals.

SELECT INTERVAL 1 YEAR + INTERVAL 1 DAY;
> 1 year 1 day

This does not technically conform to the spec.

Arrow

Arrow has 3 interval types: INTERVAL_MONTHS INTERVAL_DAY_TIME INTERVAL_MONTH_DAY_NANO

Which effectively maps to year-month intervals, day-time intervals and then a combined interval type.

westonpace commented 1 month ago

One of the big challenges I've found around this was the semantics of various arithmetic operations with interval types. Having a combined interval type more closely matches some engines, but then the actual permitted opertaions with those intervals and the return types can vary wildly.

It looks like Postgres, DuckDb, and Datafusion all have this combined interval type. Are there semantic differences between the three of those? If not, then I think this might be worth pursuing.

vbarua commented 1 month ago

Are there semantic differences between the three of those?

I was mostly comparing day-time and year-month operation differences to those allowed on a combined interval type, and not so much differences between engines on combined interval types. One difference I did find though was between Postgres and Datafusion, when adding intervals to dates.

In Postgres:

SELECT pg_typeof(DATE '2001-08-22' + INTERVAL '4' MONTH);
> timestamp without time zone

In Datafusion (via datafusion-cli)

SELECT arrow_typeof(DATE '2001-08-22' + INTERVAL '4' MONTH);
> Date32
westonpace commented 1 month ago

One difference I did find though was between Postgres and Datafusion, when adding intervals to dates.

@vbarua

That is subtle 😖

At the moment I'm still slightly in favor of adding the new type class but that might just be my Arrow bias speaking.

EpsilonPrime commented 1 month ago

The current two intervals can be combined to provide a more complete datetime interval but the omission of subsecond time is a notable omission. If the purpose is for a complete datetime interval the lack of years seems like it might be an issue (although one could combine the existing interval with year to get the complete picture). Ideally the intervals that are defined cover what time/date changes that are needed. This is unlikely to be a complete datetime interval as the precision of one year and one nanosecond is nonsensical especially given the impreciseness of a year. A month to nano seems too wide of a definition to me especially when the option to combine with a calendar interval is still available. It's probably too late given the landscape of implementations out there but an interval for time including the missing precision seems like the appropriate missing abstraction.

Blizzara commented 1 month ago

One difference I did find though was between Postgres and Datafusion, when adding intervals to dates.

@vbarua

That is subtle 😖

That is a difference in the behavior of the add:date_imonth function though, not the intervals themselves, right?

The four systems I looked at (Spark, Arrow/DF, Postgres, DuckDB) all store these intervals as a combination of three numbers (months, days, some version of seconds). Actually parquet as well. There are differences in how the second field is treated (a precision of 6 decimal in spark, nanosecs i64 in Arrow/DF, milliseconds i32 in Parquet, microseconds i64 in DuckDB. There's also difference in comparability: some note that the intervals are not comparable, some normalize the values (using e.g. 1 month = 30 days rule) to allow comparison.

Other than the comparisons, I'd say the type itself is quite well-defined. Functions using it might differ in behavior, but that's true for every other type as well.

I started a PR to add the type in https://github.com/substrait-io/substrait/pull/665

Blizzara commented 1 month ago

@EpsilonPrime

The current two intervals can be combined to provide a more complete datetime interval

When using substrait as a first-class entity, perhaps. But when using Substrait to represent existing systems, it doesn't work, since functions need a single return type. I guess one could return a struct combining the two intervals, but that would be quite complicated for both consumers and producers and isn't really a pattern that'd be used anywhere else.

omission of subsecond time is a notable omission

FWIW, the existing IntervalDayToSecond includes subsecond (microsecond) time.

the lack of years seems like it might be an issue

A year can always* be described as 12 months. Thus the month-day-nano interval can cover a range of +- 178 956 970 years which should be quite enough. It doesn't necessarily allow recreating the exact command a user typed (INTERVAL 1 YEAR 3 MONTHS) would be identical to (INTERVAL 15 MONTHS), but I guess that's not a necessary goal anyways.

This is unlikely to be a complete datetime interval as the precision of one year and one nanosecond is nonsensical especially given the impreciseness of a year. A month to nano seems too wide of a definition to me

While one can argue that using years and nanoseconds in the same interval is probably not very useful, the combined interval type also allows to say eg. "1 MONTH 1 DAY" which you cannot do with either of the existing interval types. Like you note, it is also just in practice the type that's being used out there for many of the common systems, so lacking it makes it impossible/very hard to support intervals correctly in those systems.

(* with maybe the exception of times when calendar was switched hundreds of years ago)

vbarua commented 1 month ago

It's probably too late given the landscape of implementations out there

This was my assessment as well. There are systems that combine year-month intervals and day-time intervals into a single type (Postgres, DuckDB) so it makes sense to model these kinds of intervals as first-class entities.

That is a difference in the behavior of the add:date_imonth function though, not the intervals themselves, right?

You are correct. The thing that I've found though is that the combined interval type is extremely hard to reason about in terms of behaviour, which you've also noted:

There's also difference in comparability: some note that the intervals are not comparable, some normalize the values (using e.g. 1 month = 30 days rule) to allow comparison.

In this case adding the type is easy, but the follow-up work of defining the "correct" behaviour will be challenging because different systems have different opinions about what correct means.

The combined interval type is also awkward because while you can always go from a year-month or day-time interval to a combined interval type in a lossless manner (I think), going the other way is not so. Let's say that you're going from a system with a combined interval type

SELECT DATE '2019-02-28' + INTERVAL '1 YEAR 1 DAY'

to one that does not. Which of the following two is the equivalent query:

SELECT DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY'
SELECT DATE '2019-02-28' + INTERVAL '1 DAY' + INTERVAL '1 YEAR'

Fun fact, the order matter here. See the Snowflake docs for details and/or this datafusion issue.

This is a long-winded way of saying that I think a combined interval type makes sense by virtue of it existing in the wild, but I am unhappy with it existing in general, it makes me sad to try and reason about the many possible behaviours and we have our work cut out for us defining semantics for it.

Bonus Cursed Thing

Fractional intervals in Postgres

SELECT INTERVAL '1' SECOND / 3;
> {"milliseconds":333.333}
Blizzara commented 1 month ago

The combined interval type is also awkward because while you can always go from a year-month or day-time interval to a combined interval type in a lossless manner (I think), going the other way is not so.

While that's true when designing a new system (which I think is what you meant), it sadly doesn't mean that existing things using combined interval could be represented through the year-month and day-time intervals. (Ie. there's no way to represent Spark's make_interval using current Substrait w/o a user-defined type.)

This was my assessment as well. There are systems that combine year-month intervals and day-time intervals into a single type (Postgres, DuckDB) so it makes sense to model these kinds of intervals as first-class entities.

In this case adding the type is easy, but the follow-up work of defining the "correct" behaviour will be challenging because different systems have different opinions about what correct means.

This is a long-winded way of saying that I think a combined interval type makes sense by virtue of it existing in the wild, but I am unhappy with it existing in general, it makes me sad to try and reason about the many possible behaviours and we have our work cut out for us defining semantics for it.

That's fair. FWIW, I think I can solve our need using user-defined types, so I'm not blocked on this.

Bonus Cursed Thing

Fractional intervals in Postgres

SELECT INTERVAL '1' SECOND / 3;
> {"milliseconds":333.333}

Works in Spark too :)

spark-sql (default)> select interval '1 second' / 3;
0 00:00:00.333333000

Interestingly, Spark's INTERVAL syntax doesn't allow mixing types, so I guess it creates the YearMonth or DayTime intervals. make_interval however mixes things happily, and also allows division. Seems like with a combined interval, it counts years and months together, rounded down to months, and then days-to-seconds together:

spark-sql (default)> select make_interval(2, 0, 0, 20, 0, 0, 0) / 20.2;
1 months 23 hours 45 minutes 44.554455 seconds

DF refuses the division, but I guess it wouldn't be hard to implement it.

vbarua commented 1 month ago

while you can always go from a year-month or day-time interval to a combined interval type in a lossless manner (I think)

I thought wrong 😅 There are some hacks you can do to decompose combined interval literals into their component year-month and day-time equivalents (which I've looked at previously), but you can't do that for dynamic combined intervals (i.e columns of that type)

jacques-n commented 1 month ago

NOTE: I've written several responses before posting this. As I thought more and more about it, I'd like to propose we approach this type differently given how different the behavior is between different systems:

My recollection is the SQL spec actually spent time against this and realized that trying to figure out what 1 month + 1 day was difficult enough that they decided to avoid supporting it. That's why there are only two types in the SQL specification (same as the current status of substrait).

I agree that systems have added other types and thus Substrait needs to support it somehow (propagate the mistake). I think the problem is that the semantics of those types are very poorly defined when a calculation is done. I've seen systems that will immediately take the interval and boil it down using standard conversions (e.g. 30 days = 1 month) and system that maintain full resolution.

  1. Enhance IntervalDayToSecond to have a precision which defaults to 0 but can be 0..9 (or create new/deprecate)

This seems like a net positive that aligns with our changes to timestamp, etc.

  1. Introduce types that are specific to systems as opposed to one general type. And we do this using the type extension system as opposed to putting directly in protobuf. For example spark_calendarintervaltype, arrow_monthdaynano, etc.

I started marking up the attached PR but felt like it was better to talk about overall direction here.

Note, I'm not demanding the changes above (and I know I'm late to the party). Just trying to think through how we can be supportive of different systems in a system-agnostic way that is also clearly specified.

jacques-n commented 1 month ago

Another proposal...

  1. Enhance IntervalDayToSecond to have a precision which defaults to 0 but can be 0..9 (or create new/deprecate)
  2. Create IntervalCompound which is a combination of the fields of IntervalYearToMonth and IntervalDayToSecond

It's a slight adjustment to what is proposed. Two main differences:

jacques-n commented 1 month ago

The more I think about this, the more I'm inclined to go with my second proposal for the reasons above. Makes precision clear for different engines and makes it clear that this is fundamentally a compound version of the other two.

@westonpace and @vbarua , you good with that suggestion?

westonpace commented 1 month ago

Would a microsecond precision system reject a plan with a nanosecond precision literal (even if the nanoseconds value was 0)?

jacques-n commented 1 month ago

Would a microsecond precision system reject a plan with a nanosecond precision literal (even if the nanoseconds value was 0)?

If we assume a consumer is trying to accurately execute a plan, it might be happy to do a cast of intervalcompound<9> to intervalcompound<6> or intervaldaysecond<9> to intervaldaysecond<6> when the nanoseconds are all zero. Just like a system might be happy to cast a precision_timestamp<3> literal to precision_timestamp<0> literal if there are no seconds.

I'm not clear what makes precision_timestamp different from the interval types on this front.

vbarua commented 1 month ago

I'm onboard for encoding precision into intervals in some manner.

Would a microsecond precision system reject a plan with a nanosecond precision literal

I think it should. If the consumer does not support the requested precision it should blow up to signal this to the producer. The producer can do the casts on its side if it's okay with truncation.

Even if the value is all 0, operations on that literal that use values read in from data could behave differently at different precisions. It also messes with function lookup within the plan, because if the producer is applying operations on top of the literal it has (hopefully) sent functions with the correct signatures for the precision it expects.

jacques-n commented 1 month ago

@westonpace and @EpsilonPrime , are either of you opposed to my proposal above?

westonpace commented 1 month ago

No further concerns. Adding precision to the existing and adding a compound interval both seem like improvements to me. How do you intend to add precision? By modifying the existing type or introducing a new type (as we did for timestamp)?

EpsilonPrime commented 1 month ago

I'm on board with the latest suggestion.

jacques-n commented 1 month ago

I'm inclined to modify existing interval type. I don't think there is much use yet.

@Blizzara , can you update the PR as discussed?

Update interval daysecond to support 0-9 precision (and rename the second field/docs to something like fractional_seconds)

Add new IntervalCompound type that includes all the fields from the other two types and is parameterized with 0-9 precision

Blizzara commented 1 month ago

@Blizzara , can you update the PR as discussed?

I’m currently on holidays, if someone wants to take over the PR in the meantime that’s fine by me, otherwise I’ll be happy to update it once I’m back!

The proposal overall sounds good to me!