Closed Meng-Heng closed 4 months ago
This pull request is from an external repo and will not automatically be built. The build must still be passed before it can be merged. Ask one of the team members to make a manual build of this PR.
This lexical-model is for Latin (Latn) characters. You may need to consult with the community about the following non-Latin characters. I think they're Cyrllic (Cyrl) and Georgian (Geor) characters, and should be removed this wordlist.
Count | Unicode Value | Character |
---|---|---|
6 | 0x000430 | а |
1 | 0x000431 | б |
2 | 0x000432 | в |
1 | 0x000434 | д |
2 | 0x000435 | е |
1 | 0x000437 | з |
4 | 0x000438 | и |
1 | 0x000439 | й |
4 | 0x00043A | к |
3 | 0x00043C | м |
2 | 0x00043D | н |
3 | 0x00043E | о |
5 | 0x000440 | р |
1 | 0x000441 | с |
3 | 0x000442 | т |
2 | 0x000443 | у |
1 | 0x000444 | ф |
1 | 0x000447 | ч |
1 | 0x000448 | ш |
2 | 0x00044B | ы |
169 | 0x0010F1 | ჱ |
1 | 0x0010F9 | ჹ |
10 | 0x0010FA | ჺ |
@darcywong00 is correct 0400-04FF = Cyrillic block 10A0-10FF = Georgian block Entries with these characters should be corrected (or dropped from the .tsv file)
I have removed the specified characters and parentheses in the wordlist. Thanks, @darcywong00 and @DavidLRowe!
Please approve if everything looks good. Thank you!