Closed sffc closed 3 weeks ago
This isn't 2.0-breaking but it requires bumping the version of the data struct, so we can do it around the 2.0 timeframe.
@sffc so there are two ways I can do this:
ZeroVec<char>
, where the AsULE does a UTF-32 conversionZeroVec<U24>
where the AsULE is a cheap conversion (but this needs a new type)Thoughts?
Hmm, so a couple tricky things:
char
, or keep that and use [u8; 3]
with no invariantschar::MAX + 1
for open ranges. Supporting this would be a tricky code change to allow for the list to have an odd number of elements.Yeah, this is not a straightforward fix if we're moving to ZeroVec<char>
. Some u24 type could still work.
Currently it is ZeroVec<'data, u32>
So, we should change it to ZeroVec<'data, U24>
Besides the challenges you mentioned with ZeroVec<'data, char>
, I consider that a non-starter since it makes deserialization much more expensive.
I consider U24
to be a type that is in scope for zerovec to export.
Personally I don't see a benefit of an aligned U24: can we just use RawBytes and give it helper methods to convert to/from char?
ZeroVec<UnvalidatedChar>
, and UnvalidatedChar::ULE
is RawBytes<3>
?
We actually already have potential_utf::PotentialCodePoint
with this behavior, duh. Maybe use this?
https://unicode-org.github.io/icu4x/rustdoc/potential_utf/struct.PotentialCodePoint.html
Yeah, though that's a new dep on icu_collections. Probably okay?
We were trying to pare down icu_properties transitive deps.
Yeah that's the type I meant, it used to be called UnvalidatedChar
.
https://github.com/unicode-org/icu4x/pull/5645 . It works, though the API needed to change in a bunch of places.
potential_utf
is extremely small and low-level. Based on my understanding, they're more concerned about compile times than crate count.
Okay, that's good enough for me!
Since we only need values up to
0x110000
in a CodePointInversionList, we should use a U24 instead of a u32 to save space.CC @echeran