karlb / wikdict-gen

Generation of bilingual dictionaries from Wiktionary/dbnary data for the WikDict project
http://www.wikdict.com
MIT License
43 stars 4 forks source link

Adding inflections #14

Open Vuizur opened 2 years ago

Vuizur commented 2 years ago

Hello,

thank you very much for developing this cool project! I have been working on something similar, only not based on DBnary, but instead on the Wiktextract project. Compared to your project I only have [Language]-English dictionaries, but I got the idea that you could improve your dictionaries with very little code by adding the inflection data from kaikki.org. In my project I perform some very WIP post processing, so you could also in theory take my inflection data from the published TSVs (in some cases like Spanish they are a clear improvement, in others likely still a bit buggy).

Have a great day!

karlb commented 2 years ago

That's interesting! I haven't noticed Wiktextract yet. I wonder what the Wiktextract and DBnary guys think of each other's work, since it overlaps at lot.

WikDict does have inflection data, but only for those languages where DBnary extracts it (English, German, French and Swedish, IIRC). Obviously, I would prefer to get all data from a single source rather than merging different source, which usually cause problems when joining and other inconsistencies.

I won't do anything with this right now, but I will keep an eye on Wiktextract/kaikki.org, as well as your project.

Vuizur commented 2 years ago

The Wiktextract author wrote a paper where he details the differences. I am also no expert, but I think the difference is that Wiktextract only processes the English Wiktionary, but in turn extracts more detailed information. I think its secret is that it expands the Lua code in the Wiktionary XML dump using the original Wiktionary template code (so that he gets the original inflections tables). He suspected that Dbnary only reimplemented some Lua code, leading to some bugs even in the English inflections.

(I don't know how hard it would be to integrate Wiktextract stuff into Dbnary, pretty interesting question.)

I would add the data maybe at the last step when creating the dictionary and not bother inserting them into the RDF database. The easiest way might be to put the file https://kaikki.org/dictionary/rawdata.html into something like a SQLITE database with an index on "word" if 13 GB is too large for loading it in RAM. And then simply get the inflections on demand when generating the dictionary.

I think the only bug with this approach that if inflections apply only to one part of speech, you might add them unnecessarily to all words with the same string. Another problem is removing stuff like pronouns from the inflections, sometimes you might have strings like "erholte sich" in the inflection for "erholen", for example, where you have to remove "sich" to make it findable by ebook reader lookups. This is something I am currently looking into as well.

Maybe in the future I will also feek motivated to start a pull request 😁.