CodePointInversionList should be backed by ZeroVec<U24>

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

https://icu4x.unicode.org

Other

1.37k stars 176 forks source link

Closed sffc closed 3 weeks ago

sffc commented 1 year ago

Since we only need values up to 0x110000 in a CodePointInversionList, we should use a U24 instead of a u32 to save space.

CC @echeran

sffc commented 1 year ago

This isn't 2.0-breaking but it requires bumping the version of the data struct, so we can do it around the 2.0 timeframe.

Manishearth commented 1 month ago

@sffc so there are two ways I can do this:

use ZeroVec<char>, where the AsULE does a UTF-32 conversion
use ZeroVec<U24> where the AsULE is a cheap conversion (but this needs a new type)

Thoughts?

Manishearth commented 1 month ago

Hmm, so a couple tricky things:

It currently supports ranges over the surrogate pairs. We could drop that and use char, or keep that and use [u8; 3] with no invariants
It uses char::MAX + 1 for open ranges. Supporting this would be a tricky code change to allow for the list to have an odd number of elements.

Manishearth commented 1 month ago

Yeah, this is not a straightforward fix if we're moving to ZeroVec<char>. Some u24 type could still work.

sffc commented 3 weeks ago

Currently it is ZeroVec<'data, u32>

So, we should change it to ZeroVec<'data, U24>

Besides the challenges you mentioned with ZeroVec<'data, char>, I consider that a non-starter since it makes deserialization much more expensive.

I consider U24 to be a type that is in scope for zerovec to export.

Manishearth commented 3 weeks ago

Personally I don't see a benefit of an aligned U24: can we just use RawBytes and give it helper methods to convert to/from char?

robertbastian commented 3 weeks ago

ZeroVec<UnvalidatedChar>, and UnvalidatedChar::ULE is RawBytes<3>?

sffc commented 3 weeks ago

We actually already have potential_utf::PotentialCodePoint with this behavior, duh. Maybe use this?

Manishearth commented 3 weeks ago

Yeah, though that's a new dep on icu_collections. Probably okay?

We were trying to pare down icu_properties transitive deps.

robertbastian commented 3 weeks ago

Yeah that's the type I meant, it used to be called UnvalidatedChar.

Manishearth commented 3 weeks ago

https://github.com/unicode-org/icu4x/pull/5645 . It works, though the API needed to change in a bunch of places.

sffc commented 3 weeks ago

potential_utf is extremely small and low-level. Based on my understanding, they're more concerned about compile times than crate count.

Manishearth commented 3 weeks ago

Okay, that's good enough for me!