Open foxik opened 9 years ago
UTF-8 is a good choice for internal representation, nevertheless, the modules that are responsible for similar words finding should in my opinion use UTF-32 internally:
Pros of using UTF-32 internally in the above classes
Cons of using UTF-32 internally in the above classes
I think that the pros far outweight the cons.
From my point of view:
The UTF-8 will require more complicated code, but
The most complicated method will be Lexicon::GetSimilarWords_impl, because it will have to deal with
Currently, we are using UCS-2 as internal encoding, which disallows us to use Unicode characters outside of BMP.
We should change the internal representation, the current plans is to use UTF-8:
char
andstring
datatypesSimWordsFinder::Find
will have to interpret the UTF-8 encoding and understand that one Unicode character can be represented as multiple code units. Maybe the input word will be converted to UTF-32, but I do not think so, because both lexicon and error model will be in UTF-8The alternative to UTF-8 is to use UTF-32, but