unicode-org / unilex

Lexical data at Unicode
Other
66 stars 16 forks source link

Outdated encoding for Malayalam sample #3

Open asmusf opened 6 years ago

asmusf commented 6 years ago

https://github.com/unicode-org/unilex/blob/master/data/frequency/ml.txt

This file is encoded with the Unicode 5.0 and earlier encoding for Chillu characters. (See Chapter 12 of Unicode 10.0).

behnam commented 6 years ago

IIUC, a normalization table can be used, as show in Table 12-37. Atomic Encoding of Malayalam Chillus https://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#page=65.

asmusf commented 6 years ago

I think the file should be replaced with one that uses the atomic encoding. Another issue is the use of U+0D4C which I understand is considered outdated. (Other corpora I've encountered recently do not have the latter issue).

brawer commented 6 years ago

If you don’t mind, could you send a pull request to fix the problem?