tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
791 stars 82 forks source link

Output from subextractors need to be a bit more closely aligned with original output #448

Open kristian-clausal opened 8 months ago

kristian-clausal commented 8 months ago

Tatu says: "Because the output from the original extractor is being used with other projects that rely on it being "stable", changes to it need to be minimized, while all the outputs from all the projects need to be as close as possible."

I've started on trying to use our html-generation code to create websites from the data extracted with the other extractors, so I will be just posting here issues as they come along:

The first breaking difference is just that "lang_name" in word base data is different from "lang". Because "lang" is used in the original output and there is nothing wrong with it (not really, it is perfectly fine as is) the direction of change here is "lang_name" -> "lang". This change should be pretty simple.

I will continue with trying to make html generation work with the other extracted data and will post here stuff as things come along.

empiriker commented 8 months ago

I totally agree with this sentiment.

However, I want to comment that initially when I started working on other extractors the consensus was that we shouldn't worry too much about enforcing a common output schema initially since we didn't know the requirements of the different Wiktionary editions. So I took some liberties introducing new fields and might not have been overly attentive to keeping everything aligned (though I tried).

Perhaps now with the json schemas and outputs of different extractors, it is a good time to revisit the issue and align the data structure as much as possible.

I want yet to advocate for allowing extractors to define new fields since some editions contain unique information or allow for a clean and more granular extraction than others.

For the extractors using pydantic, it could make sense to define base models that all extractors should/must inherit from.

Good luck for your endeavors!

kristian-clausal commented 8 months ago

I actually got our html generation code to generate a page from a sample of the Chinese data, and the only changes needed to make it work was:

Screenshot at 2024-01-04 11-14-28

Funnily enough, there was only one language in the Zh data sample I took (using shuf and then tailing that to get 10k entries) to have 500 entries so that it would end up on the front page of the dictionary... 世界語, Esperanto! Does Chinese Wiktionary have a really big Esperanto fanbase?

xxyzz commented 8 months ago

I have some concerns about inherit model, if some fields in the parent are not used they will also be included in schema. And models have another model in its field have to be defined again like WordEntry.

xxyzz commented 8 months ago

there was only one language in the Zh data sample I took 世界語, Esperanto! Does Chinese Wiktionary have a really big Esperanto fanbase?

There are only 314 Esperanto pages in Chinese Wiktionary, it can't be a coincidence...

kristian-clausal commented 8 months ago

there was only one language in the Zh data sample I took 世界語, Esperanto! Does Chinese Wiktionary have a really big Esperanto fanbase?

There are only 314 Esperanto pages in Chinese Wiktionary, it can't be a coincidence...

The amount of "senses" in the output dictionary html is apparently 551, but I can't make sense how this could have happened. The sample is shuffled with shuf and then a tail is taken from the sample... I have made a mistake somewhere.

kristian-clausal commented 8 months ago

I get 118675 word-entries in zh-extract.json for grep "世界語" | wc -l, out of about 2019041 for the whole thing. This includes a lot of form-of stuff, which multiplies the number of entries.

kristian-clausal commented 8 months ago

The number 314 comes from this https://zh.wiktionary.org/wiki/Category:%E4%B8%96%E7%95%8C%E8%AA%9E category, but our stuff includes https://zh.wiktionary.org/wiki/Category:%E4%B8%96%E7%95%8C%E8%AA%9E%E9%9D%9E%E8%A9%9E%E5%85%83%E5%BD%A2%E5%BC%8F this category, with 113k pages.

kristian-clausal commented 8 months ago

Issue with ru-wiktionary data: {"word": "pljusak", "lang_code": "sr-l", "lang": "", "sense": "сильный дождь"} is a translation item with "" as "lang" (substitution for lang_name), which breaks the HTML-generator, but this is because "sr-l" is a nonstandard language code. This could be considered a bug in the html-generation code, so I'll add ways for this to fail gracefully, substituting in the language code instead...

kristian-clausal commented 8 months ago

ru-wiktionary extractor: fields with "synonyms", "holonyms", "hyponyms" etc. should not be lists with pure strings: "synonyms": ["not", "like", "this", "example"], but the list should contain a dict with a minimum "word" field, because the alternative term itself can have its own tags. Examples of how complex these word-dicts can become at https://kaikki.org/dictionary/errors/mapping/index.html

So the example should be "synonyms": [{"word": "not"}, {"word": "like"}, {"word": "this"}]

This isn't something I can fix with a quick sed (well I could try, but I bet it would be more trouble than it's worth), so testing ru-wiktionary will have to wait. On to others.

EDIT:

kristian-clausal commented 8 months ago

fr-wiktionary worked out perfectly after sedding lang_name!

Screenshot at 2024-01-04 14-11-01

kristian-clausal commented 8 months ago

de.wiktionary data: similarly to the Russian one (and probably the same in ru.wiktionary data too), derived should also have dict objects (minimum {"word": "xxxx"} instead of strings.

EDIT:

kristian-clausal commented 8 months ago

es.wiktionary: senseids should be strings, not integers.

EDIT:

kristian-clausal commented 8 months ago

es.wiktionary: "ipa" fields in "sounds" items (dicts in a list) should be a string, not a list of strings.

EDIT:

I also came across translation data that didn't have a lang_name/lang field at all, but that's a bit more open to interpretation. As long as one of lang or lang_code is present, I guess it is minimally sufficient... But for readability, it would be good to have a "lang" field. I don't know what's the best thing to do in cases with "broken" or unusual/nonstandard language codes that were not possible to map properly (at the time of the code running), but this should be either standardised: "lang": "UNKNOWN_LANG_CODE", or perhaps "lang": "lc", "lang": "" or leaving "lang" out are all possible.

xxyzz commented 8 months ago

I think both French and Chinese extractor always have "lang" field. The Chinese extractor might only have "lang" if name_to_code returns empty str, and the French extractor should always have both "lang" and "lang_code".

kristian-clausal commented 8 months ago

es.wiktionary: etymology_templates items should have an args field, although I'm not sure if that args field can be an empty dict. At least in this case there definitely is a leng parameter that would appear in args.

EDIT:

Weirdly, this is now fixed..?

empiriker commented 8 months ago

es.wiktionary: "ipa" fields in "sounds" items (dicts in a list) should be a string, not a list of strings.

I also came across translation data that didn't have a lang_name/lang field at all, but that's a bit more open to interpretation. As long as one of lang or lang_code is present, I guess it is minimally sufficient... But for readability, it would be good to have a "lang" field. I don't know what's the best thing to do in cases with "broken" or unusual/nonstandard language codes that were not possible to map properly (at the time of the code running), but this should be either standardised: "lang": "UNKNOWN_LANG_CODE", or perhaps "lang": "lc", "lang": "" or leaving "lang" out are all possible.

Please feel free to take a closer look at these yourselves. I won't address these right now since I suspect these are more than a quick fix. (At the very least it requires checking whether, indeed, the way the Spanish Wiktionary provides pronunciation data, can be sensibly separated into sounds with unique ipa. It might as well be just an oversight by me but without closer examination it's hard to tell.)

kristian-clausal commented 8 months ago

Tatu thinks that for lang the option is to have an explicit lang field with a human-readable message like "Unknown ([language code])".

kristian-clausal commented 8 months ago

The final issue that prevented es-data from running on (almost) vanilla html-generation code is the structure of sound data:

{'word': 'brown', 'pos': 'adj', 'pos_title': 'adjetivo', 'lang_code': 'en', 'lang': 'Inglés', 'sounds': [{'phonetic_transcription': ['bɹaʊn'], 'audio': ['En-uk-brown.ogg', 'en-us-brown.ogg'], 'ogg_url': ['https://commons.wikimedia.org/wiki/Special:FilePath/En-uk-brown.ogg', 'https://commons.wikimedia.org/wiki/Special:FilePath/en-us-brown.ogg'], 'mp3_url': ['https://upload.wikimedia.org/wikipedia/commons/transcoded/7/7b/En-uk-brown.ogg/En-uk-brown.ogg.mp3', 'https://upload.wikimedia.org/wikipedia/commons/transcoded/2/29/En-us-brown.ogg/En-us-brown.ogg.mp3']}], 'etymology_text': 'Del inglés antiguo brūn.', 'etymology_templates': [{'name': 'etimología', 'args': {'leng': 'en', '1': 'ang', '2': 'brūn'}, 'expansion': 'Del inglés antiguo brūn'}, {'name': 'Oxford-EN', 'expansion': '«brown», Lexico. Dictionary.com; Oxford University Press.'}]}

from https://es.wiktionary.org/wiki/brown

The html-generation code assumes "audio" and "mp3_url" are strings (which breaks when joining them with + because you can't mix lists and strs), and though this seems like a minor thing these should probably done some other way: "audio" and "mp3_url" are in singular, and generally fields with singular field names are single objects (usually strings). If these were lists, they would be "audios" and "mp3_urls", but in this case the 'solution' (not a good one, but the 'correctest' one) is to have two dicts in sounds that replicate some of the data. So it should be like this:

{'sounds':
[{'phonetic_transcription': ['bɹaʊn'], 'audio': 'En-uk-brown.ogg', 'ogg_url': 'https://commons.wikimedia.org/wiki/Special:FilePath/En-uk-brown.ogg', 'mp3_url': 'https://upload.wikimedia.org/wikipedia/commons/transcoded/7/7b/En-uk-brown.ogg/En-uk-brown.ogg.mp3'}, 
{'phonetic_transcription': ['bɹaʊn'], 'audio': 'en-us-brown.ogg', 'ogg_url': 'https://commons.wikimedia.org/wiki/Special:FilePath/en-us-brown.ogg', 'mp3_url': 'https://upload.wikimedia.org/wikipedia/commons/transcoded/2/29/En-us-brown.ogg/En-us-brown.ogg.mp3'}
]
}

Having consistent field name pluralization rules is really handy. There are a couple of nagging exceptions, like "derived" which doesn't have a plural form and should have been "derived_terms" instead, and there's possibly some mistakes or exceptions I don't know or forgot about.

EDIT:

Screenshot at 2024-01-05 13-04-19

kristian-clausal commented 8 months ago

ru-wiktionary: in sense data (item in senses of the word) gloss data should be in the form of 'glosses': ['higher level string', 'more specific string']. If you have a gloss like:

1. An example gloss:
  1.2. A more specific gloss that is the actual entry

the list is needed to show the 'hierarchy'.

EDIT:

Besides that, I got ru-wiktionary to generate a site, except the glosses were missing because of the above. Good thing it was noticeable, otherwise I wouldn't have even noticed.

Screenshot at 2024-01-05 13-56-19

kristian-clausal commented 8 months ago

The last of the changes needed to make the kaikki HTML generation work with each json output is now done, and I've tested out all of the outputs (well, 10k json object samples of them, which should be enough).

Next week, I'll start to work on actually implementing the different websites... The HTML-generating code needs to be made more edition-agnostic (ie. links to wiktionary should be to "xx.wiktionary.org" not en.wiktionary.org, that sort of thing), and after that I need to tackle some bash scripting. :cry: Tatu said he'd hold my hand with that, so we'll see how it goes.

If all goes well, we might soon see individual online dictionaries for each extractor, including error data and the json output mapping stuff.

Have a good weekend, I have a hot, hot date with a bowl of soup. The weather has been jumping between +2 celsius and -26, and now it's back to a balmy -10.

tatuylonen commented 7 months ago

I think it is important for downstream usability of the data that the editions be as consistent as possible - same fields, same parts-of-speech, same tags - as much as possible. Yes there are things in some editions that are not present in others. In these cases, we can define additional fields, tags, or even parts-of-speech - but this should only be done when the data cannot be reasonably described using existing mechanisms.

Vuizur commented 7 months ago

I noticed that the English translations have the key "code" for the items of the translation array, whereas other languages such as Spanish have the key "lang_code". (I think lang_code would maybe be more consistent, as it is also used for the language code of the entry.)

xxyzz commented 2 months ago

I didn't notice en edition uses "code" for translation data when I was writing new extractor code, so I use the same lang_code field name as the root entry field. This is probably the last inconsistent field between en and non-en editions.