tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
741 stars 82 forks source link

[en] extract forms data and literal meaning from "zh-forms" template #677

Closed xxyzz closed 2 weeks ago

xxyzz commented 2 weeks ago

The code is modified from the zh edition code. The new function adds two tags("Simplified Chinese", "Traditional Chinese") and a new "literal_meaning" JSON field. The literal meaning data is commonly provided in Chengyu pages.

Data are added to base_data because "zh-forms" is above POS sections and should be included to all the following POS sections data.

Since the code are mostly the same as the zh edition code, tests are not added.

Example pages:

GitHub issue #676

xxyzz commented 2 weeks ago

"zh-forms" usually used directly under the language title. https://en.wiktionary.org/wiki/機車 seems to be an outlier.