Status of unicode - Githubissues

privat commented 9 years ago

What level of support of unicode must Nit support.

Various proposal exist in lib/string_experimentations/, how shipable are there?

For unicode I see, at least, some related issues:

literal stings in Nit source files
the reading on text file (byte vs char)
unicode-aware Char
manipulation of code-points in stings

I do not think that we need a full complex unicode support yet. Some minimal goal could be to ensure that utf8 is supported and that "étienne".capitalize == "Étienne"

lbajolet commented 9 years ago

I guess if we are not Unicode-compliant for v1, we should at least prepare for its support.

For the string_experimentations Unicode POCs, they are far from shipping, a lot of work needs to be done on the compiler for the support of Unicode, and there's the performance issues, we need to assess how slower we'll be with UTF-8 built-in support (probably not that much, but there will be a slowdown).

Besides the POC from string_experimentations rely heavily on FFI, which is as of today still unsupported in the Bootstrap step. We need to migrate the methods and structure of UnicodeChar to intern if we're to have it work within the compiler. This will also be appreciated from a performance point of view.

On the subject of characters, since the actual definition of a Char in Nit is an equivalent to a C char, I suggest we finally rename it to Byte for clarity.

If we do support Unicode or any other international encoding in a near-future, the definition of a Char will have to change drastically.

Full support as you mentioned however is a lot more work than we can handle until the release, I agree on the basic support however if we do not support a capitalize with UTF-8 for a v1 that would not really be a problem I think, Ruby seems to live just alright with it.

But at least if we could support proper indexed access, iteration, modification of UTF-8 Strings, the Chars operations can always be added later, as long as the base infrastructure is available.

Normalization and proper locale support might come later as minor or major updates to the language.

From a memory-consuption POV, I'd say we drop the utf8_with_index variant, heavier than UTF-32, what was I thinking ?!

privat commented 9 years ago

For the Char/Byte issue, I think that Char must mean unicode character (code-point) and Byte an issue related to #1267. Text must be made of Char and Char::ascii should be renamed code_point or something. I do not know what are the place where Char is used with a Byte meaning (and Text used as a byte sequence) but I do not think there is a lot of place. Maybe a grep Char could help.

Can you ( @R4PaSs ) dress a plan for the basic unicode support (elementary steps in order)? If I remember correctly you already did a review of basic solution for unicode in other languages.

lbajolet commented 9 years ago

Yes, I did review other languages, however what I found was far from perfect nearly every time.

Here's the summary:

Java/C# (I guess there's not much to spoil): it's bad. As in really really bad. They treat characters as UTF-16 codets without any regard to surrogate pairs, which are left to the end-user to handle. Most functions do not take into account these either => extreme pain when localizing apps. Once again the example to avoid copying at all costs.
Ruby is a particular case, every String is encoded in its own way so, it's not really what we want to do, is it ?
Python, in ye old tymes, was using UTF-16, in a correct way for what I've gathered, but since PEP-393, they introduced a pre-processing step to their handling of strings, converting their strings in the appropriate format depending on the biggest character in the whole string. i.e. if we place an accented character in a US-ASCII string, it is automatically converted to UTF-16 (BMP only).
Go defines their string as not required to be UTF-8, though every source file is UTF-8, it's up to the programmer to choose whether or not to convert and normalize a string to UTF-8. Their chars are affectionately named Runes, which is pretty badass if you ask me. And the language is pretty much unicode clean in itself, which is a good thing.
Haskell, their characters are UTF-8, you need to convert the input strings to this format for proper support

So, what should be do ? I guess going with UTF-8 strings is a good idea. UTF-16 is a bit too much of a waste in too many cases (the only case it wins in is the representation of asian CJK basic characters), and it keeps the disadvantage of UTF-8, that is linear access of a single char. UTF-32 might be interesting to keep in mind for special cases like heavy single-char manipulation with a lot of non-local accessing (very rare, though we might offer this alternative for these cases in the future). We could keep this in mind for heavy in-place modification too (I'm looking at you FlatBuffer), since any replacement of a char will not necessarily be constant anymore (replacing a é with a e for instance will require shifting the whole buffer 1 byte to the left with UTF-8).

I kinda like the Go model, and I think we should inspire from it, maybe not on the Rune side of naming which, as badass as it may sound, it still less clear than the usual Char. Indexed accessing is linear, but with a cache system it can be constant (amortized) since most accesses are close locally. Maybe adding a little more fool-proofness since we're definitely aiming at higher-level programmers is a good idea too, forcing strings to be UTF-8 is a good idea I think. Maybe go even further in the future and force a normalization form too ?

I guess we can do it in a few steps for the v1:

[x] Add the support of Unicode into stdlib and the compiler, that means having operations ready-to-use for UTF-8 strings
[x] Solve the Byte as Char issue #1267
[ ] Rename ASCII-related services and replace them with proper Unicode support
[x] Migrate the current services working on Bytes to Unicode-compliant services

In the future:

[ ] Support of codecs for different encodings to UTF-8
[ ] Normalization forms (choose a default one might be a good idea)
[ ] Proper handling of semantic and canonic equivalence forms
[ ] Locale support (that includes stuff like toUpper, toLower and such, and yes, it is MUCH harder than it looks to do properly)

privat commented 9 years ago

Good job @R4PaSs . I'm ecstatic to hear more about the following development.

nitlang / nit

Status of unicode #1262