tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
811 stars 84 forks source link

Gender markers in translations #63

Closed ChristianSi closed 3 years ago

ChristianSi commented 3 years ago

In today's dump, the romanization of Russian translations in the English dump has totally broken down. In nearly all cases, the Latin form is listed in parentheses after the Cyrillic word in the "word" field, while the "roman" field is missing.

For my this is a showstopper, I cannot use the dump without these romanizations and have to revert to an older version. So I hope this will be fixed quickly!

ChristianSi commented 3 years ago

The original issue has fixed itself in the latest dump. However, now gender markers (m/fn) are treated as part of the word. This seems to affect translations into all or at least many languages that have genders. Some examples:

Again this makes the dump unusable for me, so I hope for a quick solution.

tatuylonen commented 3 years ago

This should now be fixed (also on the web site). It was broken by a recent change in classify_desc() in form_descriptions.py.

I need to implement more tests to make bugs like this less likely...

tatuylonen commented 3 years ago

I just implemented tests for classify_desc(). This should make this particular issue much less likely in the future.

ChristianSi commented 3 years ago

That's good to hear!

The latest dump is still from last night, so I guess it'll be a few hours until this shows up on the website?

tatuylonen commented 3 years ago

I just started a website regeneration run, but the romanization issue was already fixed last night and should already be on the current website (I fixed the underlying problem yesterday before seeing the bug report).

The extraction/website regeneration run now also runs tests before starting the extraction and aborts if any test fails. I'll be adding more tests to reduce the risk of major problems; however, some parts of Wiktextract are quite difficult/slow to test without the full run or at least creating a cache file, as a lot of operations depend on having access to all the templates defined in the dump. Thus at least for now the tests will mostly be on some of the lower-level functions.

tatuylonen commented 3 years ago

Sorry, I just realized you actually reported two different problems. So the romanization issue should be fixed. Regarding the gender marker issue, could you please clarify which word these examples were from? (I don't see the issue in the current version on the website though, looking at a few random words.)

ChristianSi commented 3 years ago

Oh, sorry for the confusion. I'd change the issue because, when I re-downloaded the dump today, the romanization was fixed and the gender marker problem had appeared. That's still the case – the dump is from 2021-07-10 01:19:22 (don't ask me about the timezone, please). As far as I can tell, nearly all words with translations into languages such as Spanish, French, German, Russian etc. have this issue, but one word where I see it in many translations is "abolitionism".

tatuylonen commented 3 years ago

It is still not fixed. I'm looking into it now.

tatuylonen commented 3 years ago

I found the problem. I had carelessly added a .strip() call in parse_head_final_tags() and it resulted in the gender tags in translations and linkages to be ignored if the last character of the page title was the same as the gender tag.

I just restarted generating the web site. The data should update within five hours.

ChristianSi commented 3 years ago

This looks pretty good now! Most of the spurious gender markers are now gone, so the dump is usable again. Thanks a lot!

However, I still noticed a few spurious gender markers which weren't there in the past. This include:

German translation:

Portuguese translation:

Spanish translation:

French translation:

Other words are affected as well; as usual, these are just examples.

ChristianSi commented 3 years ago

Since this issue has somehow come to involve two problems, I would also like to report that, while most Russian transliterations are fine again, a few are still missing. These include:

Some Japanese transliterations have got lost. Affected words include:

For "conjugation" (fusion of organisms), one Japanese translation is "結合 (けつごう, ketsugō)", mixing both spellings and the transliteration.

Some Hindi transliterations were also lost, e.g. for:

And Arabic:

All these examples refer to transliterations that were correctly detected until a few days ago.

tatuylonen commented 3 years ago

Your first list (f or m in some translation) was all due to coding errors in Wiktionary:

I fixed all these cases in Wiktionary, but as you note, they are probably only the tip of the iceberg and I'll probably need to implement a more generic solution for the missing space case too.

Your second batch was due to romanization being classified as english. I fixed these by the following changes to the classification code:

I also added a few more test cases that should catch these if they ever resurface.

The "conjugation" one was due to an extra comma at the end (I also fixed it in Wiktionary).

I started a database update, but I'm sure there will be more cases of missing space in Wiktionary, and these will result in further m/f remaining cases. I'll look into translations with a remaining comma in more detail tomorrow (starting by extracting all such cases from the extracted json file).

tatuylonen commented 3 years ago

There are a couple of thousand cases that are still not handled that I can find. There are probably more cases where the separating comma is missing that my quick search didn't capture. I will clearly need to re-relax the rules for splitting translations and deal with the problem cases somehow separately. It may take a few days for this to stabilize.

ChristianSi commented 3 years ago

I'm very much impressed by how patiently you massage the unruly Wiktionary data into a machine-readable form!

Personally I would consider it acceptable to close this issue now, since all the problems I had noticed are fixed and the number of translations that still have issues seems to be quite limited. But since you've indicated that you're still working on this, I'll leave it to you to close this once you're satisfied with the results.

yolpsoftware commented 3 years ago

Indeed, Tatu, I cannot stress enough how much your work here is appreciated! I always thought Wiktionary is the one Wiki that is too chaotic to be parsable, and I'm not the only one to have thought that.

ChristianSi commented 3 years ago

You may be aware of it already, but in the latest dump, spurious gender markers are back, this time in the form "x or" added at the end of translations. Most common is "m or", occurring in thousands of words. One example:

But there are also others, e.g.:

tatuylonen commented 3 years ago

I fixed some issues in linkage parsing on Saturday that likely caused this (it calls the same function to parse tags from the end of the form). The old method was more general but didn't always work and sometimes ate parts of the actual word form. Some of the patterns are clearly missing from the new method but should be easy to add. I'm hoping to fix this today.

I implemented the first 130 tests for linkage parsing over the weekend. I'm planning to implement tests for translation parsing later this week.

tatuylonen commented 3 years ago

I found over 19000 instances of translations now ending in " or" (earlier it was a few dozen). The vast majority of these are errors. I've implemented fixes for most of them (probably 99%), but there are probably some that I missed on this round. I'll check again tomorrow to fix the remaining ones. A few are valid translations ending in "or" (e.g., Greek translation of "logical or") and a few are due to errors in Wiktionary (e.g., "m or ma" or "u or u" suffixes for certain Swahili translations that I believe to be typos in Wiktionary).

The website update is running; it should complete in about five hours (unless some error occurs).

ChristianSi commented 3 years ago

Looks much better now after the latest update! But I still see more than a hundred translations ending in stuff such as "f or", "f sg or", "m or", "n or", "pl or", "11 or" that almost certainly should not be there. German, Swahili, Russian, French and Italian seem particularly affected. Perfection is certainly unachievable, but I hope you'll be able to fix some more of them.

tatuylonen commented 3 years ago

I found about 900 of them today. I again fixed most of the cases, but a some probably remain for tomorrow. There is now also a debug message about them plus the diagnostic page on the web page has the errors structured better so it will hopefully make debugging these much easier as well. And current tests would have detected issue before it ever got out. (Dedicated tests for translation extraction are still mostly missing though.)

tatuylonen commented 3 years ago

I found 277 today. I again fixed a lot of cases and will check again tomorrow. An increasing fraction of them look suspicious in Wiktionary, but I don't know the various languages well enough to fix them there. I'm adding recognition of even the weirder combinations but with a debug message so many of the questionable ones can be found. There are also ones that can be found by inspecting the data; for example I've seen several references to "c" (for common gender as in Swedish, Norwegian, Danish etc) in Spanish, but I'm not aware of Spanish having such gender.

ChristianSi commented 3 years ago

This looks really good now! I'm closing this, as only very few cases remain now.

For completeness's sake, here are the words where I still spot problems – don't know whether it's due to input errors or some very special cases (some of them have two ors in a row):