Get rid of old blob formats in ICU4X 2.0

Manishearth commented 1 month ago

ICU4X supports BlobV1, BlobV2, and BlobV2Bigger.

We should get rid of BlobV1 for 2.0, V2 is strictly better.

cc @sffc

Manishearth commented 1 month ago

Also with https://github.com/unicode-org/icu4x/pull/5601, the format of BlobV2 will change anyway.

Should we rename it to BlobV1? Or call it BlobV3?

Manishearth commented 1 month ago

cc @robertbastian @zbraniecki

sffc commented 1 month ago

I prefer to make it publicly known as just "blob"

Manishearth commented 1 month ago

@sffc Don't we wish to continue to iterate on the format, though?

Manishearth commented 1 month ago

I'd like to separate out two discussions here:

Question 1: Firstly, we are invalidating the V1/V2 formats in 2.0 anyway, because of #5601. Should we get rid of the format that is currently called V1, keeping around the formats that are currently called V2/V2Bigger?

Question 2: In the new universe, what should these formats be called, in code, and in datagen arguments?

Manishearth commented 1 month ago

One thing to note is that https://github.com/unicode-org/icu4x/pull/5601 makes it a 2.0 blocker for us to do something here since it breaks V1 and V2. Or we have to make an explicit decision to be okay with the names meaning something different.

Manishearth commented 1 month ago

@sffc It currently is publicly known as "blob" in datagen, when creating a blob. However ICU4X supports consuming multiple blob formats. This issue is probably only going to be about the relatively internal (but public) enums around this.

Nope, it's blob and blob2 over datagen

Manishearth commented 1 month ago

Oh, also, should we provide nice diagnostics for people attempting to load V001 blobs? We could have BlobSchema still have V001, V002 enum variants that just load data slices or something.

sffc commented 1 month ago

Question 2: In the new universe, what should these formats be called, in code, and in datagen arguments?

The v1/v2 shows up in only the following two places, I think:

Datagen CLI: --format blob or --format blob2
Blob exporter API: BlobExporter::new_with_sink or BlobExporter::new_v2_with_sink

I don't really anticipate the blob format changing again in a breaking way post 2.0; ZeroTrie was the big thing, and that works. I've looked into alternatives and nothing beats ZeroTrie. If we need to break it, we can call a future format blob3 or something. I'm fine with saying that blob means "the default blob format for the current major release of icu4x-datagen".

Question 1: Firstly, we are invalidating the V1/V2 formats in 2.0 anyway, because of #5601. Should we get rid of the format that is currently called V1, keeping around the formats that are currently called V2/V2Bigger? ... Oh, also, should we provide nice diagnostics for people attempting to load V001 blobs? We could have BlobSchema still have V001, V002 enum variants that just load data slices or something.

This is a good idea. We should keep V001, V002, ... in the enum, but the BlobDataProvider constructor fails when it sees them. We should also create a new enum variant for the 2.0 format, since V002 won't work in ICU4X 2.0.

Manishearth commented 1 month ago

We should keep V001, V002, ... in the enum, but the BlobDataProvider constructor fails when it sees them.

Note: I'm not sure if this is easy to achieve with serde-derive. Might need a finicky manual impl.

Manishearth commented 1 month ago

Also re: question 2: I'm happy with this answer, this means we just need V2 and V2Bigger and those are the same format at datagen time.

Manishearth commented 1 month ago

I'm going to try and knock this out since it also blocks Flex removal.

I think I may have a way to conveniently do this without an onerous custom serde impl.

unicode-org / icu4x

Get rid of old blob formats in ICU4X 2.0 #5603