tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
749 stars 82 forks source link

Use example template args to determine example #615

Closed kristian-clausal closed 2 months ago

kristian-clausal commented 2 months ago

Previously, we would expand everything with clean_node and use heuristics to check whether it was an example or not.

However, seeing as how we have access to template arguments, we can bypass some of these heuristics early by checking if the text we're checking has one template, from the example templates set, with arguments that are the same as what is part of the clean_node output. In this case, we can just assume this is an example.

kristian-clausal commented 2 months ago

Meant to address #604

xxyzz commented 2 months ago

I notice en edition code doesn't handle the zh-x template properly, this template displays the example text in Traditional Chinese and Simplified Chinese forms, and could also be used in etymology section. Example page: 未雨綢繆

zh edition code parses the HTML tags expanded from the template. zh edition's template generates the same HTML tags as the en edition, but because zh code uses pydantic, the code need some changes for en code, please see #613

kristian-clausal commented 2 months ago

I introduced some exception-triggering bugs with this, so I want to first fix that and then leave it until next week (because I'm taking Friday off and Thursday is a holiday); I'll take a look at the Chinese template next week. EDIT: The new time table (it's not a cron job, and it seems dynamic, I'll have to ask Tatu) with the generation of kaikki and the new sites for the other editions means it takes longer than before for kaikki to update.