Closed Manishearth closed 1 month ago
Also with https://github.com/unicode-org/icu4x/pull/5601, the format of BlobV2 will change anyway.
Should we rename it to BlobV1? Or call it BlobV3?
cc @robertbastian @zbraniecki
I prefer to make it publicly known as just "blob"
@sffc Don't we wish to continue to iterate on the format, though?
I'd like to separate out two discussions here:
Question 1: Firstly, we are invalidating the V1/V2 formats in 2.0 anyway, because of #5601. Should we get rid of the format that is currently called V1, keeping around the formats that are currently called V2/V2Bigger?
Question 2: In the new universe, what should these formats be called, in code, and in datagen arguments?
One thing to note is that https://github.com/unicode-org/icu4x/pull/5601 makes it a 2.0 blocker for us to do something here since it breaks V1 and V2. Or we have to make an explicit decision to be okay with the names meaning something different.
@sffc It currently is publicly known as "blob" in datagen, when creating a blob. However ICU4X supports consuming multiple blob formats. This issue is probably only going to be about the relatively internal (but public) enums around this.
Nope, it's blob and blob2 over datagen
Oh, also, should we provide nice diagnostics for people attempting to load V001 blobs? We could have BlobSchema still have V001, V002 enum variants that just load data slices or something.
Question 2: In the new universe, what should these formats be called, in code, and in datagen arguments?
The v1/v2 shows up in only the following two places, I think:
--format blob
or --format blob2
BlobExporter::new_with_sink
or BlobExporter::new_v2_with_sink
I don't really anticipate the blob format changing again in a breaking way post 2.0; ZeroTrie was the big thing, and that works. I've looked into alternatives and nothing beats ZeroTrie. If we need to break it, we can call a future format blob3
or something. I'm fine with saying that blob
means "the default blob format for the current major release of icu4x-datagen".
Question 1: Firstly, we are invalidating the V1/V2 formats in 2.0 anyway, because of #5601. Should we get rid of the format that is currently called V1, keeping around the formats that are currently called V2/V2Bigger? ... Oh, also, should we provide nice diagnostics for people attempting to load V001 blobs? We could have BlobSchema still have V001, V002 enum variants that just load data slices or something.
This is a good idea. We should keep V001, V002, ... in the enum, but the BlobDataProvider constructor fails when it sees them. We should also create a new enum variant for the 2.0 format, since V002
won't work in ICU4X 2.0.
We should keep V001, V002, ... in the enum, but the BlobDataProvider constructor fails when it sees them.
Note: I'm not sure if this is easy to achieve with serde-derive. Might need a finicky manual impl.
Also re: question 2: I'm happy with this answer, this means we just need V2 and V2Bigger and those are the same format at datagen time.
I'm going to try and knock this out since it also blocks Flex removal.
I think I may have a way to conveniently do this without an onerous custom serde impl.
ICU4X supports BlobV1, BlobV2, and BlobV2Bigger.
We should get rid of BlobV1 for 2.0, V2 is strictly better.
cc @sffc