unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Reconsider UTF-32 support #545

Open dpk opened 3 years ago

dpk commented 3 years ago

string_representation.md:

The use of UTF-32 is rare enough that it's not worth supporting.

There is one significant use of UTF-32 in the real world: Python’s so-called ‘flexible string representation’. See PEP 393. The short version: Python internally stores strings as Latin-1 if they only contain characters ≤ U+00FF; as UTF-16 (guaranteed valid, fwiw) if they contain only characters in the BMP; or otherwise as UTF-32. This is intended to provide the most efficient representation for a majority of strings while retaining O(1) string indexing — it’s much like what the document says about what SpiderMonkey and V8 do, but since Python string indexing, unlike JS string indexing, returns real codepoints and not UTF-16 code units, it adds an extra upgrade to UTF-32.

In the Scheme world, R7RS Large can reasonably be expected to require that codepoint indexing into strings (or some variant of strings — it’s possible we’ll end up with a string/text split like Haskell’s) be O(1), so I expect UTF-32 or Python-style flexible string representation to become common in that context, too.

(Also, before flexible string representation was introduced into Python, UTF-32 was used for all strings.)

sffc commented 3 years ago

CC @hsivonen

hsivonen commented 3 years ago

If there is interest in interfacing with Python on that level instead of going via UTF-8, I guess that's a use case, then.

Note: Python doesn't guarantee UTF-32 validity: The 32-bit-code-unit strings can contain lone surrogates, so if the use case is interfacing with Python without UTF-8 conversion, ICU4X would need to check each code unit for being in the range for Rust char instead of assuming validity.

dpk commented 3 years ago

I would hypothetically be interested in writing a Python binding once icu4x has a C API, as an alternative to the very un-Pythonic and under-documented PyICU. (I’m the author of the PyICU cheat sheet, which is afaik the only API documentation specific to ICU in Python — otherwise, you’re just referred to the C++ API and left to work out how it maps on to Python for yourself.)

ovalhub commented 3 years ago

Author of PyICU here: if you find PyICU very unpythonic, please provide concrete examples about how you're doing something with PyICU and how you'd suggest it be done instead in order be more pythonic. I'm happy to either fix actual un-pythonic examples of PyICU use-cases or show you how it's done. Please, be very specific, there already is a lot of built-in python iterator support, for example, that you may just not know about. It's ok to ask and suggest improvements (!) I'm pretty sure that PyICU being very under-documented also contributes to your perceiving it as un-pythonic. I claimed PyICU documentation bankrupcy over a decade ago as the ICU API surface is huge and keeps growing. I cannot provide another set of docs, it's hard enough to keep up with ICU proper and the ICU docs themselves are pretty good. This is open source, the PyICU C++ wrappers around C/C++ICU are fairly regular, I encourage you to read the code to see what is possible and what is supported and how to use it.

ovalhub commented 3 years ago

For example, from your cheat sheet, you seem to not know that BreakIterator is a python iterator:

   from icu import *
   de_words = BreakIterator.createWordInstance(Locale('de_DE'))
   de_words.setText('Bist du in der U-Bahn geboren?')
   list(de_words)
   [4, 5, 7, 8, 10, 11, 14, 15, 16, 17, 21, 22, 29, 30]

Yes, I understand, you'd prefer the actual words to be returned but that's not how ICU designed the BreakIterator, they chose to give you boundaries instead. It's not that hard, in python, to then combine these boundaries into words, however. That being said, but this might become a lot of work to do consistently, adding a higher level iterator giving you the words, would be nice too.

sffc commented 3 years ago

We will revisit string encodings as we approach ICU4X v1.

sffc commented 2 years ago

Discussion:

hsivonen commented 2 years ago

This should not be taken as an endorsement of UTF-32, but as a matter of how hard things would be for the collator specifically:

Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support.

The collator and the decomposing normalizer consume an iterator over char internally (with errors mapped to U+FFFD), so adding UTF-32 support would be trivial. At compile time, there would be separate codegen instances for UTF-32, which would grow the binary size, but those instances should also be eligible to be thrown away by LTO when not used (except for FFI, there's currently a Rust symbol visibility issue standing in the way of cross-language LTO doing proper dead code analysis).

hsivonen commented 1 year ago

Since most strings don't contain supplementary-plane characters, supporting UTF-32 wouldn't really help: If most Python strings were converted to UTF-32 upon ICU4X API boundary, they might as well be converted to UTF-8 unless there are indices returned.

Indices are relevant to the segmenter. In that case, it might actually help Python to convert to UTF-32 and then segment that.

Other than that, the case where avoiding conversion to UTF-8 might make sense is the collator, which performs a lot of string reading without modification. However, to have the collator operate without having to create (converted) copies of the Python string data, there'd need to be 6 methods:

  1. Compare potentially-ill-formed UTF-32 and potentially-ill-formed UTF-32.
  2. Compare UCS-2 and UCS-2.
  3. Compare Latin 1 and Latin 1.
  4. Compare potentially-ill-formed UTF-32 and UCS-2.
  5. Compare potentially-ill-formed UTF-32 and Latin 1.
  6. Compare UCS-2 and Latin 1.

The remaining of the nine cases are mirror cases of the last three, so no point in generating code for those separately.

Note that a surrogate pair in a Python string has the semantics of a surrogate and another surrogate. The result does not have supplementary-plane semantics. I haven't checked if surrogates promote to 32-bit code units or if the 16-bit-code-unit representation can have surrogates that don't have UTF-16 semantics. That is, it's unclear to me if item 2 can reuse the UTF-16 to UTF-16 comparison specialization.

Note that the raw Python data representation is available via PyO3 "on non-Py_LIMITED_API and little-endian only".

If someone really cares, it would make sense to benchmark the collator with these 6 variants (out-of-repo) vs. converting to UTF-8 and then using the &str-to-&str comparison.