Open privat opened 9 years ago
I guess if we are not Unicode-compliant for v1, we should at least prepare for its support.
For the string_experimentations
Unicode POCs, they are far from shipping, a lot of work needs to be done on the compiler for the support of Unicode, and there's the performance issues, we need to assess how slower we'll be with UTF-8 built-in support (probably not that much, but there will be a slowdown).
Besides the POC from string_experimentations
rely heavily on FFI, which is as of today still unsupported in the Bootstrap step. We need to migrate the methods and structure of UnicodeChar to intern if we're to have it work within the compiler. This will also be appreciated from a performance point of view.
On the subject of characters, since the actual definition of a Char
in Nit is an equivalent to a C char, I suggest we finally rename it to Byte
for clarity.
If we do support Unicode or any other international encoding in a near-future, the definition of a Char
will have to change drastically.
Full support as you mentioned however is a lot more work than we can handle until the release, I agree on the basic support however if we do not support a capitalize with UTF-8 for a v1 that would not really be a problem I think, Ruby seems to live just alright with it.
But at least if we could support proper indexed access, iteration, modification of UTF-8 Strings, the Chars operations can always be added later, as long as the base infrastructure is available.
Normalization and proper locale support might come later as minor or major updates to the language.
From a memory-consuption POV, I'd say we drop the utf8_with_index
variant, heavier than UTF-32, what was I thinking ?!
For the Char/Byte issue, I think that Char must mean unicode character (code-point) and Byte an issue related to #1267.
Text must be made of Char and Char::ascii should be renamed code_point
or something.
I do not know what are the place where Char is used with a Byte meaning (and Text used as a byte sequence) but I do not think there is a lot of place. Maybe a grep Char
could help.
Can you ( @R4PaSs ) dress a plan for the basic unicode support (elementary steps in order)? If I remember correctly you already did a review of basic solution for unicode in other languages.
Yes, I did review other languages, however what I found was far from perfect nearly every time.
Here's the summary:
So, what should be do ? I guess going with UTF-8 strings is a good idea. UTF-16 is a bit too much of a waste in too many cases (the only case it wins in is the representation of asian CJK basic characters), and it keeps the disadvantage of UTF-8, that is linear access of a single char. UTF-32 might be interesting to keep in mind for special cases like heavy single-char manipulation with a lot of non-local accessing (very rare, though we might offer this alternative for these cases in the future). We could keep this in mind for heavy in-place modification too (I'm looking at you FlatBuffer), since any replacement of a char will not necessarily be constant anymore (replacing a é with a e for instance will require shifting the whole buffer 1 byte to the left with UTF-8).
I kinda like the Go model, and I think we should inspire from it, maybe not on the Rune side of naming which, as badass as it may sound, it still less clear than the usual Char. Indexed accessing is linear, but with a cache system it can be constant (amortized) since most accesses are close locally. Maybe adding a little more fool-proofness since we're definitely aiming at higher-level programmers is a good idea too, forcing strings to be UTF-8 is a good idea I think. Maybe go even further in the future and force a normalization form too ?
I guess we can do it in a few steps for the v1:
In the future:
Good job @R4PaSs . I'm ecstatic to hear more about the following development.
What level of support of unicode must Nit support.
Various proposal exist in
lib/string_experimentations/
, how shipable are there?For unicode I see, at least, some related issues:
I do not think that we need a full complex unicode support yet. Some minimal goal could be to ensure that utf8 is supported and that
"étienne".capitalize == "Étienne"