We should support running ICU4X tests with JSON data

sffc commented 3 weeks ago

In the not-so-distant past, we ran ICU4X docs tests with the icu_testdata crate, which was backed by a mix of JSON and Baked Data files. In 1.3, we switched over to using compiled data on all locales.

While this simplified our build, I think there is still value in running tests against the JSON data:

It is more understandable and debuggable. With JSON, I can change data and then immediately re-run tests without having to re-compile or re-build anything. If I want to tweak data when running a unit test, there is simply not an easy way to do that with baked data. I currently need to build custom baked data against custom CLDR JSON and then set the ICU4X_DATA_DIR variable and hope that everything is up-to-date / in-sync.
Compiled data is slow to compile. Cumulatively, I've spent many hours waiting for icu_datetime and icu_datetime(test) to build in order to iterate on my semantic skeleta work. JSON data doesn't need to be compiled, so it would lead to substantially faster Edit-Build-Test-Debug loop.
I still maintain that there is value in having the entire lifecycle of test data living in the repo, rather than relying on compiled data that is downloaded from GitHub into /tmp and then immediately converted into a machine-readable data format.

My proposal to fix this is:

Add a datagen format that is a code module API-compatible with compiled data but backed by an FS data provider
Restructure the debug JSON data files in the repo to be in this new code module
Add some cargo aliases that make it easy to locally set ICU4X_DATA_DIR
Run it in CI, either as the default job, an additional job, or piggy-backing on one of our specialized jobs like test-gigo

Additional notes:

We probably want to move away from "set of locales compiled into all components" because often we need a test locale for only one component at a time, and applying it to all components leads to a lot of redundant test data. For example, adding Finnish as a testdata locale was rejected in #5360 because it added 526 new files. It would have been nice to add it for only the datetime component where it was needed at the time.
We might be able to remove stubdata if we did this, instead pointing IDEs to the new JSON-based compiled data crates.

CC @robertbastian @Manishearth

Manishearth commented 3 weeks ago

I'm open to this.

I'd love to overall reduce the amount of generated stuff in tree though.

Manishearth commented 3 weeks ago

actually I'd like us to run tests on all three primary data formats (baked, blob, json), generated at CI time, so that we don't accidentally e.g. break zerovec postcard deserialization

Manishearth commented 3 weeks ago

actually I'd like us to run tests on all three primary data formats (baked, blob, json), generated at CI time, so that we don't accidentally e.g. break zerovec postcard deserialization

Proposal: checked in json data, and make tasks that replace it with postcard or baked data in a way that allows transparently running tests.

We could potentially have a passthrough TestDataProvider or something

sffc commented 3 weeks ago

We should definitely have a test, if we don't already have one, that JSON data can be converted to Postcard and back and PartialEq works on the round-trip. I don't think we get too much value of running tests on top of that? If the deserialization works and produces the same data as JSON/Baked, then the tests should work.

Manishearth commented 3 weeks ago

I don't think we get too much value of running tests on top of that?

There are a lot of different efficient ways of reading the data, especially for complex objects like Patterns. So I think there's some value in testing all those codepaths, but this isn't a strong opinipon of mine.

robertbastian commented 3 weeks ago

Add a datagen format that is a code module API-compatible with compiled data but backed by an FS data provider

You've proposed this before and it doesn't work because of singletons.

unicode-org / icu4x

We should support running ICU4X tests with JSON data #5420