Closed donnerpeter closed 3 months ago
Solved by referenced pull requests and fixed on kaikki.org.
I see that this is present in JSON as a raw string. Is this compound information available in a more structured format?
These etymology texts are plain texts with some links, I don't think they could be converted to any structured data.
I see, too bad :( Thanks for the explanation and your work!
Unless the text is commonly standardized into a format that doesn't change (preferably at all), it's really hard for us to manually parse them. We do this sometimes for the English edition for very common boilerplate stuff, but it's a ton of work, it's never perfect and the target can move. In this case, it would have to be text that is very consistent and then someone who knows German would need to code (or at least make a perfect spec for someone to us) a bespoke mini-parser with a ton of regexing and if-then.
The easiest place to try to get structured data is from (stable) templates and tables. However, even with tables people just tend to do things so willy-nilly that the process of creating a table-extractor is a real pain. The English edition has some tables be read right-to-left.
Using wikitext markup as the basis for a dictionary makes it accessible for humans to input stuff, but...
https://de.wiktionary.org/wiki/Beizjagd has "Determinativkompositum aus dem Stamm des Verbs beizen und dem Substantiv Jagd", but the JSON doesn't seem to:
https://ru.wiktionary.org/wiki/Krankenversicherung has "От Kranker, элемента -en и Versicherung.", but the JSON doesn't seem to: