Open omarandlorraine opened 2 years ago
Afrikaans, Azeri and Breton seem okay
Belarusian is affected
Bulgarian, Catalan, Czech, Welsh, German, Greek, Spanish seem okay
Estonian is a weird one, some words are correct (for example, lõketki) but some words like "belut¨ki" have come out wrong
Faroese, Basque, Frisian, Indonesian, Hebrew, Hungarian, Icelandic, Kazakh, Korean, Italian seem okay
Latin and Latvian seem okay
Catalan and Luxembourgish both have some kind of grammatical tags after the word; is this intentional?
Dutch and both Norwegians are okay
Malay seems okay
I can do more checking later if this kind of information is useful?
Lithuanian is affected also; the file has words like abromiðkë (neither ð nor ë are Lithuanian characters)
Thank you for all the detective work! I'll start to address this soon. I really appreciate it, as it's hard for me to check this sort of thing (except for the obvious ones which I just missed :sweat_smile:)
Catalan and Luxembourgish both have some kind of grammatical tags after the word; is this intentional?
No, it's not intentional. All of them had it before and I removed it with some regex but I must have missed those two files
Just to consolidate, I'll list the languages that you've found to be affected:
By the grammatical tags, do you mean the things like this: al·lolàlia
? (for Catalan, the Luxembourgish ones were pretty obvious)
It turns out the reason my regex didn't catch it is because they both have non-standard formatting
If the Catalan assumption is correct, then I've addressed this in #4
After some digging it was found that, al·lolàlia
is the correct spelling. That character is the interpunct. It always appears between l's, and needs to be treated as a letter
Oops! I'll go ahead and revert the change for Catalan then... Thanks for the catch!
I also have a branch with corrections for the file encoding, which I'll start a PR for soon.
I'll go ahead and merge them with your approval @omarandlorraine. Thank you for all the hard work! I really appreciate and I wouldn't be able to add language support like this without awesome community members like you! :)
Thanks for your kind words @hisbaan! It's a pleasure to be able to help.
I've done some more checks:
Occitan: the encoding is good but the file has tags in it, like apadoïta di:* id:4412
it looks like the tags are separated by whitespace and contain a colon.
Polish has encoding problems
Russian has encoding problems
Slovene has encoding problems
Venda wordlist seems to contain some non-Venda words like "am-serverwithnoidentities", they seem to be software related jargon that's somehow gotten mixed in
The same goes for Xhosa, it's got entries like "formwizard", "gb-2312-80", "getlasterror",
As for the Zulu one, there are lines starting with pfx
, I believe these are codes used by some spell checker maybe to add the correct prefixes to a lemma.
I think that's all the problems
For Xhosa and Venda, there are a LOT of these mixed in words. Every single LaTeX command seems to be here (like varphi, vartheta, etc.) along with lots of menu options (config, about, etc.). Whoever generated the original list did some sort of automated collecting I suppose. I'll look for new sources since these have many of these mixed in words
I have added a few more fixes to the PR now
Russian seems to have the wrong encoding. The same problem appears to also affect Bosnian and possibly Estonian (grep for the diaeresis, not an Estonian character).