unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Move normalization data scalar value validation from GIGO to deserialization #2458

Open hsivonen opened 2 years ago

hsivonen commented 2 years ago

(It was previously concluded in a meeting that it's not a post-1.0 breaking change to switch from bogus data causing GIGO to bogus data causing a constructor to err out.)

In the normalization data:

ZeroVec<'data, u16> should use a BmpChar type (that has the same bit representation as u16) instead of using u16. The BmpChar type should be validated at deserialization time not to be in the surrogate range. There should be a way to get a char out of BmpChar without the caller having to use an unsafe block.

ZeroVec<'data, U24> should use char instead of U24.

The u32 trie value should be replaced with some kind of special type that has the same bit representation as u32 and that has these constraints:

If the low half is 1 or 0: No further constraints. If the high half is 0: The low half is either a BMP non-surrogate value or a value between 0xD801 and 0xD8FE, inclusive. Otherwise: Both halves are BMP non-surrogate values.

sffc commented 1 year ago

Does the trie value look something like this?

enum TrieValueForNorm {
  Zero,
  One(u16),
  SingleBmp(BmpChar),
  DoubleBmp(BmpChar, BmpChar)
}
sffc commented 1 year ago

Discussion: the type can be

struct TrieValueForNorm(u32);

impl TrieValueForNorm {
  pub fn get_foo(&self) -> BmpChar { /* can contain unsafe */ }
  pub fn get_bar(&self) -> BmpChar { /* can contain unsafe */ }
}

and it has a custom ULE impl to TrieValueForNormULE.

It's fine to convert the remaining two ZeroVecs to have deserialization-time validation, so long as it is linear-time validation.

sffc commented 1 year ago

Note: This is fine to implement because it does not change the postcard binary format; it only changes the runtime types, which we have established can be changed in minor releases.