Data Version - Githubissues

sffc commented 4 years ago

Preface: I listed a few different definitions of the word "version" in data-pipeline.md:

Data Version: A version reflecting the data itself, abstracted away from the format version and schema version. For example, CLDR 37 may be a data version.
Format Version: A version of the file format, abstracted away from the schema version and data version. For example, Protobuf 2 and Protobuf 3 are format versions.
Key Version: A version of a key requesting data corresponding to a struct definition, like "version 1 of decimal format symbols" (discussed further down in the doc)
Schema Version: A version of the schema, abstracted away from the format version and data version. For example, data may be reorganied within the JSON file between schema versions.

In this issue, I want to discuss Data Version, specifically with regard to CLDR.

For convenience, I will copy the section of my doc entitled "Data Version" in its entirety:

The data version is expected to be a well-defined, namespaced identifier for the origin of the data. For example, when represented as a string, the following might be data versions:

CLDR_37_alpha1 → Vanilla CLDR 37 alpha1

CLDR_37_alpha1_goog2020a → Google patch 2020a on top of CLDR 37 alpha1

FOO_1_1 → Version 1.1 of data from a hypothetical data source named Foo

The first data version subtag, or namespace, defines the syntax for the remainder of the identifier. For example, the CLDR namespace might accept two or three subtags: major version (37), minor version (alpha1), and optional patch version (goog2020a).

Note: The syntax for the data version is undefined at this time. What is shown above is merely a strawman example.

Some questions I want to discuss in this thread:

Where is the data version used?

In my doc, the data version is only used in the Response object: you ask a data provider for a specific key, and the data provider responds with a hunk of data and a data version associated with it.

Do we also want the data version elsewhere?

Should you be able to request a specific data version? If so, should we support some kind of semantic versioning?
Should a data provider announce what data versions it supports? What if it supports different data versions for different pieces of data, such as a data provider that pulls some locales from a data file and other locales from a web service?

What is the syntax for the data version?

I'd like to have a string representation of the data version such that we can pass it easily on interchange. @echeran had some ideas about the data version syntax. Some questions:

What is the crucial information that we want to convey in a data version?
What exact syntax conventions do we want to adopt?
Should CLDR and the UCD have different namespaces for the data version?

echeran commented 4 years ago

To start off the discussion... I think the data version is important for communicating between the user and library about the exact snapshot of data being expected & provided. So I think that we should allow the data version to be requested. In fact, I think that data requests already have an implicit context -- which is some particular data version. (Ex: different versions have different data keys available, so how do you know which data key to request without first basing on a specific version?) So if that context is implicit, we should consider making it explicit and requiring it.

As far as semantic versioning, I no longer give deference to it as the preferred way to do versioning or see the topic so singularly after seeing this talk. I think some REST APIs will just have some integer version N starting from 1 counting up for each change, which is neither better or worse, it's just an alternative. And maybe all of the versioning schemes are all the same in some sense, so long as they convey the essential properties of the code/data: 1) was there a change?, and 2) which version is before / after this one? SemVer makes sense in theory, but in practice, it can be full of internal quirks (Java versions, macOS versions) and odd comparisons (Clojure code written 12 years ago (1.0) still runs on the latest version (1.10), but when I last wrote Scala all of the 2.x versions released annually generated binaries that were all binary incompatible with each other).

In order to assist the user in communicating which data versions are desired and supported, I think the data provider should be able to indicate which versions it supports. I think this is what the user would care most about, and the rest is details. If it turns out that versions N, N+1, and N+2 are all similar and change only in the ideal way (adding) and no removing/changing, then maybe it is easy for the data provider to optimize by storing the diffs -- either git style, or like how every RDBMS schema change always comes with an upgrade and downgrade script. But I think the user would care first about getting the answer to the question "Do you have data for key K in version X of the data?"

We can come up with systems that allow for more ambiguity (ex: user request of the provider, "give me the data for this key in the latest version of data that you have available"), but I think that's layering on more complexity by intertwining separate concerns. Whether we do or not, I thinkk we should at least have a foundation of the simpler request (ex: "key K, version X").

sffc commented 4 years ago

I think the data version is important for communicating between the user and library about the exact snapshot of data being expected & provided.

Okay. Can you elaborate on the use cases? In what context would you want "give me CLDR 37" rather than "give me the latest stable data version"?

I think the API should be key-based, since, as you say, a data provider might support different keys for different data versions. So, the question would be, "given this key, what data versions do you support?" followed by "give me this key in this specific data version".

I think we can punt the semantic versioning problem to the data source. For example, "CLDR 37" could mean the most recent version of CLDR 37 (e.g., m1, alpha1, alpha2, beta1, rev1, rev2, rev3).

echeran commented 4 years ago

As far as use cases, I don't have specific ones in mind, so let me know what actually sounds reasonable in reality. I'm imagining that there's some resource-constrained device or Flutter app that cannot update itself, and therefore it has a particular version of ICU4X bundled that won't change. The logic (code) of ICU4X is written against a specific version of data, no different than ICU. The view I'm taking is that a new version of data may introduce a breaking change (incompatible with ICU4X code somehow), and having logical snapshots of data would prevent the breakage. (And I guess something like what git does for files could help do so efficiently by storing diffs).

But if we know for sure incremental changes in CLDR data never cause problems for ICU & ICU4X code, and if the problem is rather due to downstream user error in improperly handling i18n output w/o using i18n fns, then maybe the opposite of what I said is best -- trade the guarantees of strict version matching for the functionality gains of always using latest version of data. Or maybe having logical snapshots of data is impractical in reality for some reason.

Semantic versioning can be used in either of the 2 opposite alternatives described above and either way, I wouldn't mind. It seems to especially make sense if we assume that data updates never really break old code, and to just use semver's major version to indicate the exceptions to the assumption.

sffc commented 4 years ago

ICU4X is written against a specific key version, not data version. This is exactly why I split the concept of data versioning into four different types of versions, listed in the OP and in data-provider.md. ICU4X itself should not break when the data version changes, and we will have robust integration testing to ensure that is the case.

However, if downstream clients make bad assumptions, they might break on a data version bump. Based on experience, the most common thing downstream clients can do that make them break on locale data updates is by modifying the output string to conform to some style guide that their UI person created. With the data provider pipeline, we are giving clients a better way to override specific bits of locale data in a forward-compatible way, which I hope will help cover our bases and reduce the chances that a data version bump breaks a downstream client.

In other words, keys have a specific contract on how data should behave, and data versions are snapshots of locale data that conform to those contracts.

nciric commented 4 years ago

For me this is somewhat related to #150. The end user (developer) shouldn't care where the data is coming from, or which version of data is required. Somebody™ should configure the project for the developer (like we do in Chrome). We should also have sane defaults for smaller projects (no configuration necessary, but also not as flexible).

You can say that I am deferring the problem to another part of the system, and you would be right - I think all data dependencies should be resolved in data provider cache. Data provider cache should take care of async loading of data if necessary, or giving existing data back to the constructors.

Couple questions:

If I specify CLDR 37 alpha 1 as data version, would that apply to all cuts of that data? (30 locales, or special OEM cut that drops break iterator info for SEA locales)?
How do developers make sure that APPL 74 minor 8 data works with version X of code? I can see us making sure CLDR is tested, but how would others make sure that their data works with our code?

If we focus on CLDR data versioning and assume that rational OEM would do expansion/cuts on it then we can make sure our crates expose version range (almost like features) they support. Say:

cldr_version_range = ["CLDR 37 alpha 1", "CLDR 39 beta 3"]

then we change code, and suddenly our range constricts:

cldr_version_range = ["CLDR 39", "CLDR 41 alpha 1"]

One problem I see with string versioning is that it's hard to do ranges.

sffc commented 4 years ago

then we change code

Our code is built against a key version, not a data version (CLDR version).

sffc commented 4 years ago

If I specify CLDR 37 alpha 1 as data version, would that apply to all cuts of that data? (30 locales, or special OEM cut that drops break iterator info for SEA locales)?

The answer is unclear and would be up to the data provider to decide, not ICU4X.

How do developers make sure that APPL 74 minor 8 data works with version X of code? I can see us making sure CLDR is tested, but how would others make sure that their data works with our code?

As long as a provider provides all of the keys required for a parritular version of ICU4X, then ICU4X will work. The provider will return an error result if an unsupported key is requested from it.

I don't know how to better illustrate the difference between key version and data version. Maybe this table will help?

CLDR 38 changed the currency symbol for a hypothetical currency from 1 code point to 2 code points. ICU4X code assumed that it was 1 code point. ICU4X adds a new key version that supports multiple code points, and starts using the new key instead of the old key.

Data Version	CLDR Ground Truth	Key Version 1	Key Version 2
CLDR_37	`"$"`	`"$"`	`"$"` (forward-port)
CLDR_38	`"$$"`	`"$"` (back-port)	`"$$"`

In other words, if CLDR data changes in an incompatible way, the CLDR-to-ICU4X transformer (which I'm working on) should continue backporting the data to produce the old key for a while, until we drop support for the old keys and ICU4X versions.

So, I hope we can stop saying "ICU4X version Y supports CLDR 35 through 38" and start saying "the ICU4X data transformer is capable of mapping CLDR 38 to support ICU4X versions Y through Z".

sffc commented 4 years ago

Tentative decisions from meeting on 2020-07-17:

Do not include the data version in the request. The data provider itself can pick the desired source. Clients can write a data provider that does the logic they desire.
Use strings such as "org.unicode.cldr@37.1" and "com.google@2020.1" for the data version in the response. The part after the '@' should conform to SemVer. Vendor-specific overlays should have their own version string, rather than appending a suffix to the CLDR version string.

sffc commented 4 years ago

CC @markusicu @macchiati

sffc commented 3 years ago

Revisit closer to 1.0

sffc commented 2 years ago

I wrote this doc:

https://docs.google.com/document/d/1yg_2l5FFo0aAuNi4jpgcIhIYjHqJyUoJWtMduyQ0vR8/edit

There is nothing implementable until after our first potential breaking data change after 1.0. So I'll put this on backlog until such a situation comes up.

unicode-org / icu4x

Data Version #165

Where is the data version used?

What is the syntax for the data version?