Change internal representation to UTF-8

foxik commented 9 years ago

Currently, we are using UCS-2 as internal encoding, which disallows us to use Unicode characters outside of BMP.

We should change the internal representation, the current plans is to use UTF-8:

we will use char and string datatypes
input and output will be in UTF-8 (as it is today)
tokenizer will work on input UTF-8 string and the created tokens will be pointers to the original text
lexicon will contain words in UTF-8 (and transitively language models and morphology will use UTF-8)
error model will be in UTF-8, i.e. it will contain variable-length strings instead of tuples or triples Unicode characters
the SimWordsFinder::Find will have to interpret the UTF-8 encoding and understand that one Unicode character can be represented as multiple code units. Maybe the input word will be converted to UTF-32, but I do not think so, because both lexicon and error model will be in UTF-8

The alternative to UTF-8 is to use UTF-32, but

using UTF-8 is a standard solution, it is being used in Python/Perl (and for example in Python UTF-16/UTF-32 were used at some point in the past)
the UTF-8 representation is much more compact
even though UTF-8 disallow constant time random access, we only access word characters sequentially in Korektor; moreover, se can always perform UTF-8 <-> UTF-32 conversion

michalisek commented 9 years ago

UTF-8 is a good choice for internal representation, nevertheless, the modules that are responsible for similar words finding should in my opinion use UTF-32 internally:

Lexicon - the current implementation based on TRIE requires character of fixed length
SimWordsFinder - this class uses direct character access by index
ErrorModel - should use the same encoding as the SimWordsFinder (since error model is queried by SimWordsFinder)

Pros of using UTF-32 internally in the above classes

faster code
simpler code
less code changes required

Cons of using UTF-32 internally in the above classes

higher memory consumption (only Lexicon matters, error models are small in comparison)

I think that the pros far outweight the cons.

foxik commented 9 years ago

From my point of view:

I am not sure the code will be faster with UTF-32, as the keys of the ErrorModel will be larger (4x for ASCII, ~3x for Czech)
The UTF-8 will require more complicated code, but
- Lexicon will be unaffected (it will store bytes of UTF-8 encoding without understanding), except for GetSimilarWords_impl
- ErrorModel will be unaffected (it will store bytes of UTF-8 encoding without understanding)
- SimWordsFinder (which only handles casing) accesses the characters sequentially, so it will be simple to modify
The most complicated method will be Lexicon::GetSimilarWords_impl, because it will have to deal with
- when adding/replacing a character, it has to add possibly multiple bytes from the Lexicon trie
- when deleting/replacing character from input string, it will have to remove possibly multiple bytes (from the end of the string)
the language models will eventually be in UTF-8 (either when we use library like kenlm, or when we rewrite them to use hashes)
eventually I want to rewrite Lexicon structure (it currently takes more time to find the suggestions than to query the language models), and UTF-8 will be much more suited for the new representation I have in mind

ufal / korektor

Change internal representation to UTF-8 #9