Wrong encoding for some files

hisbaan / wordlists

A collection of word lists in different languages, compiled from around the web, for use in various word-related projects.

MIT License

4 stars 2 forks source link

Wrong encoding for some files #1

Open omarandlorraine opened 2 years ago

omarandlorraine commented 2 years ago

Russian seems to have the wrong encoding. The same problem appears to also affect Bosnian and possibly Estonian (grep for the diaeresis, not an Estonian character).

omarandlorraine commented 2 years ago

Afrikaans, Azeri and Breton seem okay

Belarusian is affected

Bulgarian, Catalan, Czech, Welsh, German, Greek, Spanish seem okay

Estonian is a weird one, some words are correct (for example, lõketki) but some words like "belut¨ki" have come out wrong

Faroese, Basque, Frisian, Indonesian, Hebrew, Hungarian, Icelandic, Kazakh, Korean, Italian seem okay

Latin and Latvian seem okay

Catalan and Luxembourgish both have some kind of grammatical tags after the word; is this intentional?

Dutch and both Norwegians are okay

Malay seems okay

I can do more checking later if this kind of information is useful?

Lithuanian is affected also; the file has words like abromiðkë (neither ð nor ë are Lithuanian characters)

hisbaan commented 2 years ago

Thank you for all the detective work! I'll start to address this soon. I really appreciate it, as it's hard for me to check this sort of thing (except for the obvious ones which I just missed :sweat_smile:)

Catalan and Luxembourgish both have some kind of grammatical tags after the word; is this intentional?

No, it's not intentional. All of them had it before and I removed it with some regex but I must have missed those two files

Just to consolidate, I'll list the languages that you've found to be affected:

Belarusian (edit, fixed)
Estonian (edit, fixed)
Catalan (edit, fixed)
Luxembourgish (edit, fixed)
Lithuanian (edit, fixed)
Russian (edit, fixed)
Bosnian (edit, fixed)

hisbaan commented 2 years ago

By the grammatical tags, do you mean the things like this: al·lolàlia? (for Catalan, the Luxembourgish ones were pretty obvious)

It turns out the reason my regex didn't catch it is because they both have non-standard formatting

If the Catalan assumption is correct, then I've addressed this in #4

omarandlorraine commented 2 years ago

After some digging it was found that, al·lolàlia is the correct spelling. That character is the interpunct. It always appears between l's, and needs to be treated as a letter

hisbaan commented 2 years ago

Oops! I'll go ahead and revert the change for Catalan then... Thanks for the catch!

I also have a branch with corrections for the file encoding, which I'll start a PR for soon.

hisbaan commented 2 years ago

I'll go ahead and merge them with your approval @omarandlorraine. Thank you for all the hard work! I really appreciate and I wouldn't be able to add language support like this without awesome community members like you! :)

omarandlorraine commented 2 years ago

Thanks for your kind words @hisbaan! It's a pleasure to be able to help.

I've done some more checks:

Occitan: the encoding is good but the file has tags in it, like apadoïta di:* id:4412 it looks like the tags are separated by whitespace and contain a colon.

Polish has encoding problems

Russian has encoding problems

Slovene has encoding problems

Venda wordlist seems to contain some non-Venda words like "am-serverwithnoidentities", they seem to be software related jargon that's somehow gotten mixed in

The same goes for Xhosa, it's got entries like "formwizard", "gb-2312-80", "getlasterror",

As for the Zulu one, there are lines starting with pfx, I believe these are codes used by some spell checker maybe to add the correct prefixes to a lemma.

I think that's all the problems

hisbaan commented 2 years ago

For Xhosa and Venda, there are a LOT of these mixed in words. Every single LaTeX command seems to be here (like varphi, vartheta, etc.) along with lots of menu options (config, about, etc.). Whoever generated the original list did some sort of automated collecting I suppose. I'll look for new sources since these have many of these mixed in words

I have added a few more fixes to the PR now