Open sffc opened 4 years ago
To start off the discussion... I think the data version is important for communicating between the user and library about the exact snapshot of data being expected & provided. So I think that we should allow the data version to be requested. In fact, I think that data requests already have an implicit context -- which is some particular data version. (Ex: different versions have different data keys available, so how do you know which data key to request without first basing on a specific version?) So if that context is implicit, we should consider making it explicit and requiring it.
As far as semantic versioning, I no longer give deference to it as the preferred way to do versioning or see the topic so singularly after seeing this talk. I think some REST APIs will just have some integer version N starting from 1 counting up for each change, which is neither better or worse, it's just an alternative. And maybe all of the versioning schemes are all the same in some sense, so long as they convey the essential properties of the code/data: 1) was there a change?, and 2) which version is before / after this one? SemVer makes sense in theory, but in practice, it can be full of internal quirks (Java versions, macOS versions) and odd comparisons (Clojure code written 12 years ago (1.0) still runs on the latest version (1.10), but when I last wrote Scala all of the 2.x versions released annually generated binaries that were all binary incompatible with each other).
In order to assist the user in communicating which data versions are desired and supported, I think the data provider should be able to indicate which versions it supports. I think this is what the user would care most about, and the rest is details. If it turns out that versions N, N+1, and N+2 are all similar and change only in the ideal way (adding) and no removing/changing, then maybe it is easy for the data provider to optimize by storing the diffs -- either git style, or like how every RDBMS schema change always comes with an upgrade and downgrade script. But I think the user would care first about getting the answer to the question "Do you have data for key K in version X of the data?"
We can come up with systems that allow for more ambiguity (ex: user request of the provider, "give me the data for this key in the latest version of data that you have available"), but I think that's layering on more complexity by intertwining separate concerns. Whether we do or not, I thinkk we should at least have a foundation of the simpler request (ex: "key K, version X").
I think the data version is important for communicating between the user and library about the exact snapshot of data being expected & provided.
Okay. Can you elaborate on the use cases? In what context would you want "give me CLDR 37" rather than "give me the latest stable data version"?
I think the API should be key-based, since, as you say, a data provider might support different keys for different data versions. So, the question would be, "given this key, what data versions do you support?" followed by "give me this key in this specific data version".
I think we can punt the semantic versioning problem to the data source. For example, "CLDR 37" could mean the most recent version of CLDR 37 (e.g., m1, alpha1, alpha2, beta1, rev1, rev2, rev3).
As far as use cases, I don't have specific ones in mind, so let me know what actually sounds reasonable in reality. I'm imagining that there's some resource-constrained device or Flutter app that cannot update itself, and therefore it has a particular version of ICU4X bundled that won't change. The logic (code) of ICU4X is written against a specific version of data, no different than ICU. The view I'm taking is that a new version of data may introduce a breaking change (incompatible with ICU4X code somehow), and having logical snapshots of data would prevent the breakage. (And I guess something like what git does for files could help do so efficiently by storing diffs).
But if we know for sure incremental changes in CLDR data never cause problems for ICU & ICU4X code, and if the problem is rather due to downstream user error in improperly handling i18n output w/o using i18n fns, then maybe the opposite of what I said is best -- trade the guarantees of strict version matching for the functionality gains of always using latest version of data. Or maybe having logical snapshots of data is impractical in reality for some reason.
Semantic versioning can be used in either of the 2 opposite alternatives described above and either way, I wouldn't mind. It seems to especially make sense if we assume that data updates never really break old code, and to just use semver's major version to indicate the exceptions to the assumption.
ICU4X is written against a specific key version, not data version. This is exactly why I split the concept of data versioning into four different types of versions, listed in the OP and in data-provider.md. ICU4X itself should not break when the data version changes, and we will have robust integration testing to ensure that is the case.
However, if downstream clients make bad assumptions, they might break on a data version bump. Based on experience, the most common thing downstream clients can do that make them break on locale data updates is by modifying the output string to conform to some style guide that their UI person created. With the data provider pipeline, we are giving clients a better way to override specific bits of locale data in a forward-compatible way, which I hope will help cover our bases and reduce the chances that a data version bump breaks a downstream client.
In other words, keys have a specific contract on how data should behave, and data versions are snapshots of locale data that conform to those contracts.
For me this is somewhat related to #150. The end user (developer) shouldn't care where the data is coming from, or which version of data is required. Somebody™ should configure the project for the developer (like we do in Chrome). We should also have sane defaults for smaller projects (no configuration necessary, but also not as flexible).
You can say that I am deferring the problem to another part of the system, and you would be right - I think all data dependencies should be resolved in data provider cache. Data provider cache should take care of async loading of data if necessary, or giving existing data back to the constructors.
Couple questions:
If we focus on CLDR data versioning and assume that rational OEM would do expansion/cuts on it then we can make sure our crates expose version range (almost like features) they support. Say:
cldr_version_range = ["CLDR 37 alpha 1", "CLDR 39 beta 3"]
then we change code, and suddenly our range constricts:
cldr_version_range = ["CLDR 39", "CLDR 41 alpha 1"]
One problem I see with string versioning is that it's hard to do ranges.
then we change code
Our code is built against a key version, not a data version (CLDR version).
- If I specify CLDR 37 alpha 1 as data version, would that apply to all cuts of that data? (30 locales, or special OEM cut that drops break iterator info for SEA locales)?
The answer is unclear and would be up to the data provider to decide, not ICU4X.
- How do developers make sure that APPL 74 minor 8 data works with version X of code? I can see us making sure CLDR is tested, but how would others make sure that their data works with our code?
As long as a provider provides all of the keys required for a parritular version of ICU4X, then ICU4X will work. The provider will return an error result if an unsupported key is requested from it.
I don't know how to better illustrate the difference between key version and data version. Maybe this table will help?
CLDR 38 changed the currency symbol for a hypothetical currency from 1 code point to 2 code points. ICU4X code assumed that it was 1 code point. ICU4X adds a new key version that supports multiple code points, and starts using the new key instead of the old key.
Data Version | CLDR Ground Truth | Key Version 1 | Key Version 2 |
---|---|---|---|
CLDR_37 | "$" |
"$" |
"$" (forward-port) |
CLDR_38 | "$$" |
"$" (back-port) |
"$$" |
In other words, if CLDR data changes in an incompatible way, the CLDR-to-ICU4X transformer (which I'm working on) should continue backporting the data to produce the old key for a while, until we drop support for the old keys and ICU4X versions.
So, I hope we can stop saying "ICU4X version Y supports CLDR 35 through 38" and start saying "the ICU4X data transformer is capable of mapping CLDR 38 to support ICU4X versions Y through Z".
Tentative decisions from meeting on 2020-07-17:
"org.unicode.cldr@37.1"
and "com.google@2020.1"
for the data version in the response. The part after the '@'
should conform to SemVer. Vendor-specific overlays should have their own version string, rather than appending a suffix to the CLDR version string.CC @markusicu @macchiati
Revisit closer to 1.0
I wrote this doc:
https://docs.google.com/document/d/1yg_2l5FFo0aAuNi4jpgcIhIYjHqJyUoJWtMduyQ0vR8/edit
There is nothing implementable until after our first potential breaking data change after 1.0. So I'll put this on backlog until such a situation comes up.
Preface: I listed a few different definitions of the word "version" in data-pipeline.md:
In this issue, I want to discuss Data Version, specifically with regard to CLDR.
For convenience, I will copy the section of my doc entitled "Data Version" in its entirety:
Some questions I want to discuss in this thread:
Where is the data version used?
In my doc, the data version is only used in the Response object: you ask a data provider for a specific key, and the data provider responds with a hunk of data and a data version associated with it.
Do we also want the data version elsewhere?
What is the syntax for the data version?
I'd like to have a string representation of the data version such that we can pass it easily on interchange. @echeran had some ideas about the data version syntax. Some questions: