open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

[ucd] Make FromStr follow UAX44-LM3 for char props #121

Open CAD97 opened 7 years ago

CAD97 commented 7 years ago

Rust names for aliases defined in the Unicode Character Database will be consistent with the formal long aliases under UAX44-LM3. This is an invariant and helps for API discovery and navigability.

Rust names will follow Rust naming conventions. This is an invariant and helps for API discovery navigability.

UCD aliases for properties are given by PropertyAliases.txt. ACD aliases for property values are given by [PropertyValueAliases.txt].

The question then, is how to deirive the rust name from the long alias.

99.9% of long aliases in PropertyValueAliases.txt are of form Long_Name. For those it is clear that the algoritm for long alias to rust name is just:

But the 0.1% is Decomposition_Type=Nobreak (dt=Nb).

If we apply the above algorithm, we get DecompositionType::Nobreak. However, it might be more in line with Rust API guidelines to name it DecompositionType::NoBreak, which is still equivalent under UAX44-LM3 (or even the subset of just case insensitivity).

Do we allow this less-strict transformation between the formal long alias and the rust alias, or do we stick to the simple mapping?

CAD97 commented 7 years ago

I prefer DecompositionType::NoBreak. To me, the type is No-break, as it is described in the informal description. It should therefor be broken into the two parts No and Break; ergo, NoBreak.

To add some weight to my opinion, I first was exposed to the decomposition types through UAX44 Table 14. Compatibility Formatting Tags, where it is presented as <noBreak>. I therefor looked for NoBreak rather than Nobreak.

behnam commented 7 years ago

Yeah, totally makes sense. We can update the docs based on the this discussion, and even note this special case in there as one example of improving casing while staying conformant to LM3.