Wrapper of ICU's UTF-16 encoded strings and conversion

nitlang / nit

Nit language

http://nitlanguage.org

Apache License 2.0

239 stars 65 forks source link

Wrapper of ICU's UTF-16 encoded strings and conversion #2773

Closed kugelbltz closed 5 years ago

kugelbltz commented 5 years ago

`u16_string` module

This module is meant to ease the use of complex string operations provided by the ICU library. The module provides a wrapper for ICU's string structure : UChar * as well as conversion fucntions to/from String and CString

Remarks

In order to convert a String to a U16String, the string must be converted to a CString first. Since CString's are null temrinated, U16String's also have to be null terminated and cannot have embedded termination characters.
I am having some issues with DocUnits blocs thus there are no tests in the comments at the moment.
I added an other new operator to the CString class which only returns a null string

lbajolet commented 5 years ago

Note: I have one question, do the UChar* from ICU require 0-terminated strings? If it is not the case, we could get away with strings that have a null byte in their contents (this is legal in Nit) when we start from any Text subclass

kugelbltz commented 5 years ago

@R4PaSs No, they do not have to be 0-terminated in ICU. But since I am working with the Nit FFI, I still have to somewhat convert strings into char * to use the library.

kugelbltz commented 5 years ago

First, thank you @R4PaSs @Morriar for your feedback. I have modified the module by taking into account your suggestions. The U16String class is now a subclass of Text so I redefined the chars function which uses the char_at_offset function of UCharString. The latter function returns a UTF-32 (UChar32 in ICU) character which rendered the need for a U16Char class useless. I also decided to scratch the []= function as it was unnecessary for modules to come.

kugelbltz commented 5 years ago

I have figured out how to deal with embeded \0 characters in strings and tried to clear some confusion with the capacity and code_units attributes. There are 3 new private classes : U16StringCharView, U16StringCharReverseIterator and U16StringCharIterator which are meant to be used for the U16String.chars function. They are basically a copy of the same classes in the flat module as I thought it was the right way to do it.

kugelbltz commented 5 years ago

@Morriar @R4PaSs Do you think that the last version is okay ?

nitlang / nit

Wrapper of ICU's UTF-16 encoded strings and conversion #2773

u16_string module

Remarks

`u16_string` module