substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.16k stars 150 forks source link

Add collation support to relevant data types (new types or enhance existing) #685

Open jacques-n opened 3 weeks ago

jacques-n commented 3 weeks ago

We need to solve collation. Because collation can different on different fields/literals (e.g. a table with two columns of different collation or two tables with different collations), I propose that we add collation as a property of the string data types. There are three main string data types:

It seems like we have two options for how to introduce collation:

  1. introduce 3 new types that include a collation property
  2. enhance existing types so they have collation properties but are backwards compatible to avoid migration pain.

I suggest we do the second option. I think this would lead to having a new way to express compound types with default options. For example, maybe we say the following would both be legal:

string => a string type with default collation
string<af_na> => a string type with [af_na collation](https://www.localeplanet.com/icu/af-NA/index.html)

I propose we use the ICU locale names to reference collations with the addition of a pseudo collation called binary. Binary would be the default collation if a parameter is not given.

In function definitions I would be inclined to say that if an argument is specified without a collation, the function applies to all collations (as opposed to what might be interpreted as only the binary collation). This means that string in a plan would mean something slightly different than string in an extension but I think the benefits of backwards compatibility and likely expected behavior would be best with this compromise.

Thoughts?