tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
822 stars 88 forks source link

Hyphenation data missing in non-English editions #853

Open platinorum opened 1 month ago

platinorum commented 1 month ago

There is no data for hyphenation in the output file. This is possibly related to an old issue (#159).

Example: "apple" on wiktionary vs. "apple" on kaikki

kristian-clausal commented 1 month ago

It's there. Open raw data and search for "hyphenation": it's present for the noun and the verb.

platinorum commented 1 month ago

It's there. Open raw data and search for "hyphenation": it's present for the noun and the verb.

You are right, it is in the English edition, sorry for not testing properly, I guess. I checked for the French, German, Spanish and Polish editions, and it seems to work in none of them.

xxyzz commented 1 month ago

es edition has "syllabic" field in "sounds" lists, de edition's "Worttrennung" section currently is not extracted, fr and pl editions don't seem to have this kind of data.

Hyphenation data are added to es and de editions: #863, #864

kristian-clausal commented 1 month ago

Yeah, because the editions are each so different, data like hyphenation needs to be specially programmed into their respective extractors. In some languages, having separate fields for hyphenation makes no sense because they use predictable rules or syllables. Please keep in mind that English hyphenation data is only applicable to writing, specifically how you are supposed to divide words on line boundaries, it's not actual phonetic or 'real' language data.