Open Vuizur opened 2 years ago
Hi,
I did a lot of work in that direction with the Latin version of en.wiktionary using the data on kaikki.org. I created a dictionary from that which also knows all inflected versions of a word and the relation to the lemma, e.g. for "gönnerhaften" it would show that the base lemma is "gönnerhaft" and that the form could be the Akkusativ singular, Nominativ plural.
In the end, I'm creating a basic lemmatizer for Latin texts with this: Input a Latin text and it'll show you the possible base words and forms of each word in the text.
My approach to process the data from kaikki.org works like this:
This preprocessing leaves me with a short entry for each lemma, for example
{"id":"fabula/1","lemma":"fabula","partOfSpeech":"Noun","heads":["{{la-noun|fābula<1>}}"],"inflections":["{{la-ndecl|fābula<1>}}"],"senses":["discourse, narrative","a fable, tale, story","a poem, play","concern, matter","romance"]}
Initially, I then used the original Lua scripts that power Wiktionary's inflection engine for Latin and ran them in a little sandbox so that they don't depend on all the Wiktionary scripts. I created that sandbox using Fengari so that I can run the whole thing in a browser and in a backend. That worked really well, I was able to derive all the inflections using that scripts, based on only the inflection template string.
But that turned out to be too slow. Processing the entire Latin part of the Wiktionary (10 MB of JSON output from the preprocessor, ~60k words) takes about 45 minutes like that.
So in the end, I re-implemented the inflection engine's Lua scripts in TypeScript here, using the output of the original scripts to create test vectors to make sure my engine has 100% the same output. That brought the processing time down to a minute, leaving me with a system that allows reading a Latin text in the browser where the system provides possible base words, form information and translations for every word to support the reader.
I explained the process in detail to support the following claim: I don't think that a general solution could work. The languages are very different and also the entire infrastructure of the Wiktionary is completely different for every language. They all have their own inflection scripts with different parameters etc., even their output doesn't use a common format.
I think the most viable approach would be creating preprocessors for each language that convert the Wiktionary dump into a format that removes the differences through abstraction, and then going from there using the original Lua scripts as a base.
Hope that helps!
I think the wiktextract maintainers put a lot of work into the inflection table extraction, and the current state of the data is really good, especially if you consider that it is the first project properly extracting inflections for all languages. But running it probably takes quite a lot of resources.
My main idea for fixing entries like "die gönnerhaften" was to have a function that looks at all inflections that consist of more words than the base word and heuristically dropping one of these words based on length and Levenshtein distance to the original word. (And then look at how this approach works or doesn't work for each language).
I would like to further improve the quality of the extracted inflections. However, the extraction of certain inflections as "-" is intentional - it gives potentially useful information indicating that the form does not exist/is not attested for the word or in the language. As for the multiword constructions, including them is certainly controversial. The goal is to mark them with the "multiword-construction" tag, so that they can easily be removed in applications that don't need them. However, they could be useful in, for example, applications where we try to automatically learn aspects of the grammar of the language. Thus I've chosen to include them.
Constructions that differ with just the article are more of a question mark. They should now be marked with "includes-article", and applications could easily skip such forms. However, having so many fairly trivial forms for thousands/tens of thousands of words increases the data size, which makes overall use of it more cumbersome.
For many applications, for example for the creation of lemmatization lists or dictionaries, it would be super useful to have a post-processed version of the inflections, or something like a function that can be called that will do the post-processing. I was thinking of following features:
The program might be out of scope for Wiktextract, but I think this at least something that should be solved "centrally" and not by everyone reimplementing something similar separately. So maybe someone has an idea how/where to solve this the best way.