substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.14k stars 148 forks source link

Nanosecond precision in timestamps #592

Closed EpsilonPrime closed 6 months ago

EpsilonPrime commented 7 months ago

There are some engines that support nanosecond precision in their timestamps (Spark would be one of these). All of the time and timestamp types in Substrait only support millisecond precision. It might be useful to provide guidance on how to handle this in such engines.

Some potential approaches:

  1. Create user-defined types to represent nanosecond precision.
  2. Change the existing types by either changing them to nanosecond precision or allowing the precision to be requested (timestamp[6] vs timestamp[9]).
  3. Recommend truncation of nanosecond precision to millisecond precision.
vibhatha commented 7 months ago

This is merely based on the proposed approaches.

Approach 1, User-defined types would be a more dynamic approach which could provide flexibility, though, I believe Approach 2 or Approach 3 would rather be helpful to add the required support to the Substrait spec. Having things user-defined means it could support but it would be better to keep the feature part of the core framework?

And between Approach 2 and Approach 3, seems like Approach 2 would be better.

westonpace commented 7 months ago

On the slack there was concern that approach 1 is too difficult. If this is true then maybe we need to revisit our support for user defined types. A "precision-parameterized-timestamp" should be usable everywhere that a "timestamp" is usable (same goes for a "precision-parameterized-timestamp_tz"). The only difference I think is that there is one extra method for extracting nanoseconds. As a result it should be as straightforward as defining the type with INHERITS. So I think this should be a poster child for user defined types.

All that being said, I support approach 2 as long as it is allowing the precision to be requested and not changing them to nanosecond precision

arkanovicz commented 7 months ago

Does it make sense to have the precision parameter be optional and defaulting to 6, if backward compatibility is a concern?

richtia commented 7 months ago

Approach 1: If i'm understanding correctly, we would just create a new type_variation? These, and other existing user defined types, are harder to search for on the site. But this new precision-parameterized-timestamp seems like it would be useful for people to know about since there are a multiple people interested in it. As Vibatha mentioned, it doesn't seem like it would be part of the core framework.

Approach 2 (allowing precision to be requested): This would change the timestamp from being a Simple Type to a Compound Type since it now needs to be configured with a parameter right? If that's the case, is there a way to set a default value for the parameter? That would also mean we need to update the all existing timestamp input args in the extension functions that use them.

EpsilonPrime commented 7 months ago

There are two places where we'd be concerned about defaulting the type for a parameterized time. There's the protos and then there's the YAML. Defaulting the value in the protos should be straightforward and would allow for backwards compatibility of old data (newer data run in older clients wouldn't be caught though). The next place to update the type would be in the YAML files. Something like what is happening in the arithmetic_decimal file seems like what we'd need to do. Finally after the change goes in we need to start updating consumers to be aware of the new parameter.

Times would definitely be a compound type going forward. However I don't think we want to have YAMLs going forward that allow an unparameterized type. The mere presence of the parameter is going to clue folks in that they need to change their code.