ufal / korektor

Statistical spell- and (occasional) grammar-checker.
lindat.mff.cuni.cz/services/korektor
BSD 2-Clause "Simplified" License
17 stars 4 forks source link

Change internal representation to UTF-8 #9

Open foxik opened 9 years ago

foxik commented 9 years ago

Currently, we are using UCS-2 as internal encoding, which disallows us to use Unicode characters outside of BMP.

We should change the internal representation, the current plans is to use UTF-8:

The alternative to UTF-8 is to use UTF-32, but

michalisek commented 9 years ago

UTF-8 is a good choice for internal representation, nevertheless, the modules that are responsible for similar words finding should in my opinion use UTF-32 internally:

Pros of using UTF-32 internally in the above classes

Cons of using UTF-32 internally in the above classes

I think that the pros far outweight the cons.

foxik commented 9 years ago

From my point of view: