tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
802 stars 83 forks source link

Text at the beginning of Translations section gets merged with first sense #45

Closed tatuylonen closed 3 years ago

tatuylonen commented 3 years ago

Reported by Christian Siefkes:

Hello Tatu,

I noticed a few oddities or errors while working with the English dump (kaikki.org-dictionary-English.json.bz2).

Regarding the information from https://en.wiktionary.org/wiki/aunt#Translations , there is stuff there right after the "Translation" section in the wikitext ("Several languages distinguish between blood aunts..."). In your parse, this text is mixed into the first word sense ("a parent's sister or sister-in-law").

Moreover, the Chinese / Mandarin are missing in your parse. In the raw text, these are nested two levels deep, which seems to confuse your parser.

Regarding https://en.wiktionary.org/wiki/flower#Translations: your translations start with Luhya, all earlier translations are missing. Moreover, the translations have no sense information, though two different word senses are listed in Wikipedia.

Not sure if you can fix this, but though I would let you know.

Thanks and best regards Christian

tatuylonen commented 3 years ago

The first issue (text from beginning of the "Translation" section getting into sense) should now be fixed (web site will probably be updated by tomorrow).

The second issue (Mandarin translations missing) has also now been fixed.

I've not yet looked into the third issue (re "flower").

tatuylonen commented 3 years ago

The issue with "flower" should now also be fixed. This probably affected many other words as well. The fix should appear on the web site over the weekend.

ChristianSi commented 3 years ago

Thanks, i've just downloaded the latest dump and there the word sense issue of "aunt" as well as the issue with "flower" are indeed fixed.

However, the Chinese translations of "aunt" still seem to be missing.

tatuylonen commented 3 years ago

To me it looks like the Chinese translations of "aunt" would be there (under Mandarin and Cantonese, as Wiktionary has dedicated language codes for these). See https://kaikki.org/dictionary/English/meaning/a/au/aunt.html

However, the handling of the english text (related to word sense) is inconsistent as they annotate it differently. I'm working on code that would be able to handle both (for linkages such as synonyms next, but I expect to merge it for translations as well in a few days). It's based on heuristically classifying parenthesized expressions as tags, romanizations, english, or other (and I'll probably add taxonomic species names as an additional category). So far the classification seems to work fairly well (not yet committed to the repository).

ChristianSi commented 3 years ago

You're right, the Chinese translations are there. The Mandarin translations are listed under more specific word senses ("a parent's sister or sister-in-law (father's elder brother's wife)" etc.), which confuses my parser, but it corresponds to the info in Wiktionary.

tatuylonen commented 3 years ago

I'm thinking to change the extractor so that the text from the translation entry itself would go in the "english" field rather than the sense field. The main reason for this is that the sense in the translation list identifies the meaning of the source word, whereas the English text in the translation identifies the sense of the target word (though this is not entirely consistent). Thus I think it makes more sense to treat them separately.

Overall I'm planning a bigger overhaul of translation extraction next week, primarily to handle parenthesized strings more intelligently (now they often go in a wrong field). I'm planning to use the same classification approach that I now use for linkages.

ChristianSi commented 3 years ago

There is an "english" field in translations? Can't remember having seen that before.

Looking forward to the overhauled extraction approach, it sounds great!

tatuylonen commented 3 years ago

I think I'm mixing it up with linkages - there was one in linkages but I recently merged it with sense, which I now think I should revert. However I think there is need for a similar field in translations - some translations have clarifying text, generally restricting the meaning of the translation or when it can be used.