sszuev / fastText_java

Java port of c++ version of facebook fasttext
Other
14 stars 5 forks source link

Fix computeSubwords method #8

Closed y-yammt closed 5 years ago

y-yammt commented 5 years ago

According to the link, the Unicode check (c & 0xC0) == 0x80 is only applied if strings are encoded in UTF-8. This PR fixes the character extraction from strings encoded in UTF-16.

sszuev commented 5 years ago

Thanks for PR. Good catch. Although, it seems, the code can be slightly optimized (e.g. double calling codePointAt for i=j), it works, has a test, and seems to be better than it was.