tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
749 stars 82 forks source link

Zhexamples #628

Closed kristian-clausal closed 1 month ago

kristian-clausal commented 2 months ago

I've added one more branch to the extract_example multi-line if-tree to account for how Template:zh-x formats its examples. Because it's a special case, it's simpler than usual (just using classify_desc to put each line into a box).

For example in: https://en.wiktionary.org/wiki/%E6%9C%AA%E9%9B%A8%E7%B6%A2%E7%B9%86

Note that each line is split on square brackets because [ interferes with classify_desc, most probably on purpose (it classifies it as 'other' instead of 'romanization'). zh-x uses a lot of "text [Specific Chinese language, trad. or simp.]" style qualifiers, which we're going to ignore and add as part of the text itself. Examples inside senses don't have a tags field, and I don't want to add them unless there's a lot more need for it.

The [ also broke the heuristics for the original code, so I also added a negative condition regarding zh-x up in the first if condition to counteract that. If there are more Chinese templates like this, or maybe if all Chinese examples have this exact format, then we can expand the condition or make it more general (with "lang_code == 'zh'", for example).

It took me too long messing with this code to realize it would be a royal mess to integrate this stuff into the 'general' branches, and adding a special case isn't going to be that expensive. Just makes the code even longer.

xxyzz commented 2 months ago

The "etymology_text" field is still missing for the Chinese section JSON data in page "作", there might be a bug in code adding the extracted etymology data to the final dictionary variable.

kristian-clausal commented 2 months ago

I tested this with a stripped-down version of the article, but didn't try the the whole thing. Using the whole Chinese section, there's not etymology data.

kristian-clausal commented 2 months ago

The issue with 作 seems to be that Glyph origin, Etymology and Pronunciation are all on the same level. My minimal zuo.txt didn't have the Pronunciation titles + pron subsections, which is why the etymologies got associated with its definitions.

Chinese articles look like this:

===Glyph Origin===
...
===Etymology===
...
===Pronunciation 1===
...
====Definitions====
...
===Pronunciation 2===
...
====Definitions====

Currently it seems etymology information for POS sections is not duplicated and the first POS section on the page receives it.

Ok, I found a Chinese article where the Pronunciation is a level 4 ====Pronunciation====.

A hack might be to make the Pronunciation section a level 4 when it's right after an Etymology template. The other level 4 sections after that (under the Pronunciation template previously) will now be children of the Etymology node, and other Pronunciation sections will be left alone.

EDIT:

Well, that didn't work.

xxyzz commented 2 months ago

The etymology section data are added now(also restore the etymology data in Japanese section) but the first pronunciation section data are added to the second "definitions" POS section data and the second pronunciation section data also added to the first POS section.

xxyzz commented 1 month ago

All issues seem to be fixed. Should we merge this, or you want to test the code on more pages?

kristian-clausal commented 1 month ago

I don't trust this at all, I had so much trouble getting it to work, so I will try to diff (or whatever is appropriate for json) differences in output between the main branch and this branch. Thanks for taking a look!

kristian-clausal commented 1 month ago

I FINALLY got the jsondiff thing to work. The script is slow as molasses, I got caught trying to figure out why things didn't work due to several minor bugs compounding each other (for example: a continue that was indented too low, and having the name of the file be 'jsondiff' which messed up with the importing of jsondiff...), then I had messed up because I needed to re-extract stuff due to unrelated changes to the code (ignored etymology templates causing diffs, of course)... But looking at the diff right now, and adding some new section terms to a couple of places, it seems this is where I want it to be!!!