Open kristian-clausal opened 8 months ago
I totally agree with this sentiment.
However, I want to comment that initially when I started working on other extractors the consensus was that we shouldn't worry too much about enforcing a common output schema initially since we didn't know the requirements of the different Wiktionary editions. So I took some liberties introducing new fields and might not have been overly attentive to keeping everything aligned (though I tried).
Perhaps now with the json schemas and outputs of different extractors, it is a good time to revisit the issue and align the data structure as much as possible.
I want yet to advocate for allowing extractors to define new fields since some editions contain unique information or allow for a clean and more granular extraction than others.
For the extractors using pydantic, it could make sense to define base models that all extractors should/must inherit from.
Good luck for your endeavors!
I actually got our html generation code to generate a page from a sample of the Chinese data, and the only changes needed to make it work was:
jq
(which took a while when I tried to figure it out), but then when the generation broke when translation items needed "lang" too I just used sed
to blast the whole file, so there might have been a couple of other lang_name
!= lang
conflicts somewhere else.Funnily enough, there was only one language in the Zh data sample I took (using shuf
and then tailing that to get 10k entries) to have 500 entries so that it would end up on the front page of the dictionary... 世界語, Esperanto! Does Chinese Wiktionary have a really big Esperanto fanbase?
I have some concerns about inherit model, if some fields in the parent are not used they will also be included in schema. And models have another model in its field have to be defined again like WordEntry
.
there was only one language in the Zh data sample I took 世界語, Esperanto! Does Chinese Wiktionary have a really big Esperanto fanbase?
There are only 314 Esperanto pages in Chinese Wiktionary, it can't be a coincidence...
there was only one language in the Zh data sample I took 世界語, Esperanto! Does Chinese Wiktionary have a really big Esperanto fanbase?
There are only 314 Esperanto pages in Chinese Wiktionary, it can't be a coincidence...
The amount of "senses" in the output dictionary html is apparently 551, but I can't make sense how this could have happened. The sample is shuffled with shuf
and then a tail is taken from the sample... I have made a mistake somewhere.
I get 118675 word-entries in zh-extract.json for grep "世界語" | wc -l
, out of about 2019041 for the whole thing. This includes a lot of form-of stuff, which multiplies the number of entries.
The number 314 comes from this https://zh.wiktionary.org/wiki/Category:%E4%B8%96%E7%95%8C%E8%AA%9E category, but our stuff includes https://zh.wiktionary.org/wiki/Category:%E4%B8%96%E7%95%8C%E8%AA%9E%E9%9D%9E%E8%A9%9E%E5%85%83%E5%BD%A2%E5%BC%8F this category, with 113k pages.
Issue with ru-wiktionary data: {"word": "pljusak", "lang_code": "sr-l", "lang": "", "sense": "сильный дождь"}
is a translation item with "" as "lang" (substitution for lang_name), which breaks the HTML-generator, but this is because "sr-l" is a nonstandard language code. This could be considered a bug in the html-generation code, so I'll add ways for this to fail gracefully, substituting in the language code instead...
ru-wiktionary extractor: fields with "synonyms", "holonyms", "hyponyms" etc. should not be lists with pure strings: "synonyms": ["not", "like", "this", "example"]
, but the list should contain a dict with a minimum "word" field, because the alternative term itself can have its own tags. Examples of how complex these word-dicts can become at https://kaikki.org/dictionary/errors/mapping/index.html
So the example should be "synonyms": [{"word": "not"}, {"word": "like"}, {"word": "this"}]
This isn't something I can fix with a quick sed (well I could try, but I bet it would be more trouble than it's worth), so testing ru-wiktionary will have to wait. On to others.
EDIT:
fr-wiktionary worked out perfectly after sedding lang_name!
de.wiktionary data: similarly to the Russian one (and probably the same in ru.wiktionary data too), derived
should also have dict objects (minimum {"word": "xxxx"}
instead of strings.
EDIT:
es.wiktionary: senseids should be strings, not integers.
EDIT:
es.wiktionary: "ipa" fields in "sounds" items (dicts in a list) should be a string, not a list of strings.
EDIT:
I also came across translation data that didn't have a lang_name/lang field at all, but that's a bit more open to interpretation. As long as one of lang or lang_code is present, I guess it is minimally sufficient... But for readability, it would be good to have a "lang" field. I don't know what's the best thing to do in cases with "broken" or unusual/nonstandard language codes that were not possible to map properly (at the time of the code running), but this should be either standardised: "lang": "UNKNOWN_LANG_CODE", or perhaps "lang": "lc", "lang": "" or leaving "lang" out are all possible.
I think both French and Chinese extractor always have "lang" field. The Chinese extractor might only have "lang" if name_to_code
returns empty str, and the French extractor should always have both "lang" and "lang_code".
es.wiktionary: etymology_templates items should have an args
field, although I'm not sure if that args
field can be an empty dict. At least in this case there definitely is a leng
parameter that would appear in args.
EDIT:
Weirdly, this is now fixed..?
es.wiktionary: "ipa" fields in "sounds" items (dicts in a list) should be a string, not a list of strings.
I also came across translation data that didn't have a lang_name/lang field at all, but that's a bit more open to interpretation. As long as one of lang or lang_code is present, I guess it is minimally sufficient... But for readability, it would be good to have a "lang" field. I don't know what's the best thing to do in cases with "broken" or unusual/nonstandard language codes that were not possible to map properly (at the time of the code running), but this should be either standardised: "lang": "UNKNOWN_LANG_CODE", or perhaps "lang": "lc", "lang": "" or leaving "lang" out are all possible.
Please feel free to take a closer look at these yourselves. I won't address these right now since I suspect these are more than a quick fix. (At the very least it requires checking whether, indeed, the way the Spanish Wiktionary provides pronunciation data, can be sensibly separated into sounds with unique ipa. It might as well be just an oversight by me but without closer examination it's hard to tell.)
Tatu thinks that for lang
the option is to have an explicit lang
field with a human-readable message like "Unknown ([language code])".
The final issue that prevented es-data from running on (almost) vanilla html-generation code is the structure of sound data:
{'word': 'brown', 'pos': 'adj', 'pos_title': 'adjetivo', 'lang_code': 'en', 'lang': 'Inglés', 'sounds': [{'phonetic_transcription': ['bɹaʊn'], 'audio': ['En-uk-brown.ogg', 'en-us-brown.ogg'], 'ogg_url': ['https://commons.wikimedia.org/wiki/Special:FilePath/En-uk-brown.ogg', 'https://commons.wikimedia.org/wiki/Special:FilePath/en-us-brown.ogg'], 'mp3_url': ['https://upload.wikimedia.org/wikipedia/commons/transcoded/7/7b/En-uk-brown.ogg/En-uk-brown.ogg.mp3', 'https://upload.wikimedia.org/wikipedia/commons/transcoded/2/29/En-us-brown.ogg/En-us-brown.ogg.mp3']}], 'etymology_text': 'Del inglés antiguo brūn.', 'etymology_templates': [{'name': 'etimología', 'args': {'leng': 'en', '1': 'ang', '2': 'brūn'}, 'expansion': 'Del inglés antiguo brūn'}, {'name': 'Oxford-EN', 'expansion': '«brown», Lexico. Dictionary.com; Oxford University Press.'}]}
from https://es.wiktionary.org/wiki/brown
The html-generation code assumes "audio" and "mp3_url" are strings (which breaks when joining them with +
because you can't mix lists and strs), and though this seems like a minor thing these should probably done some other way: "audio" and "mp3_url" are in singular, and generally fields with singular field names are single objects (usually strings). If these were lists, they would be "audios" and "mp3_urls", but in this case the 'solution' (not a good one, but the 'correctest' one) is to have two dicts in sounds
that replicate some of the data. So it should be like this:
{'sounds':
[{'phonetic_transcription': ['bɹaʊn'], 'audio': 'En-uk-brown.ogg', 'ogg_url': 'https://commons.wikimedia.org/wiki/Special:FilePath/En-uk-brown.ogg', 'mp3_url': 'https://upload.wikimedia.org/wikipedia/commons/transcoded/7/7b/En-uk-brown.ogg/En-uk-brown.ogg.mp3'},
{'phonetic_transcription': ['bɹaʊn'], 'audio': 'en-us-brown.ogg', 'ogg_url': 'https://commons.wikimedia.org/wiki/Special:FilePath/en-us-brown.ogg', 'mp3_url': 'https://upload.wikimedia.org/wikipedia/commons/transcoded/2/29/En-us-brown.ogg/En-us-brown.ogg.mp3'}
]
}
Having consistent field name pluralization rules is really handy. There are a couple of nagging exceptions, like "derived" which doesn't have a plural form and should have been "derived_terms" instead, and there's possibly some mistakes or exceptions I don't know or forgot about.
EDIT:
ru-wiktionary: in sense data (item in senses
of the word) gloss data should be in the form of 'glosses': ['higher level string', 'more specific string']
. If you have a gloss like:
1. An example gloss:
1.2. A more specific gloss that is the actual entry
the list is needed to show the 'hierarchy'.
EDIT:
Besides that, I got ru-wiktionary to generate a site, except the glosses were missing because of the above. Good thing it was noticeable, otherwise I wouldn't have even noticed.
The last of the changes needed to make the kaikki HTML generation work with each json output is now done, and I've tested out all of the outputs (well, 10k json object samples of them, which should be enough).
Next week, I'll start to work on actually implementing the different websites... The HTML-generating code needs to be made more edition-agnostic (ie. links to wiktionary should be to "xx.wiktionary.org" not en.wiktionary.org, that sort of thing), and after that I need to tackle some bash scripting. :cry: Tatu said he'd hold my hand with that, so we'll see how it goes.
If all goes well, we might soon see individual online dictionaries for each extractor, including error data and the json output mapping stuff.
Have a good weekend, I have a hot, hot date with a bowl of soup. The weather has been jumping between +2 celsius and -26, and now it's back to a balmy -10.
I think it is important for downstream usability of the data that the editions be as consistent as possible - same fields, same parts-of-speech, same tags - as much as possible. Yes there are things in some editions that are not present in others. In these cases, we can define additional fields, tags, or even parts-of-speech - but this should only be done when the data cannot be reasonably described using existing mechanisms.
I noticed that the English translations have the key "code" for the items of the translation array, whereas other languages such as Spanish have the key "lang_code". (I think lang_code would maybe be more consistent, as it is also used for the language code of the entry.)
I didn't notice en edition uses "code" for translation data when I was writing new extractor code, so I use the same lang_code
field name as the root entry field. This is probably the last inconsistent field between en and non-en editions.
Tatu says: "Because the output from the original extractor is being used with other projects that rely on it being "stable", changes to it need to be minimized, while all the outputs from all the projects need to be as close as possible."
I've started on trying to use our html-generation code to create websites from the data extracted with the other extractors, so I will be just posting here issues as they come along:
The first breaking difference is just that "lang_name" in word base data is different from "lang". Because "lang" is used in the original output and there is nothing wrong with it (not really, it is perfectly fine as is) the direction of change here is "lang_name" -> "lang". This change should be pretty simple.
I will continue with trying to make html generation work with the other extracted data and will post here stuff as things come along.