ssb22 / CedPane

Chinese-English Dictionary Public-domain Additions for Names Etc (CedPane)
http://ssb22.user.srcf.net/cedpane/
The Unlicense
4 stars 1 forks source link

Separator between two consecutive capital #10

Closed chinese-words-separator closed 2 years ago

chinese-words-separator commented 2 years ago
北京理工大學 北京理工大学 [Bei3 jing1_Li3 Gong1_Da4 xue2] /Beijing Institute of Technology/Institute of Technology, Beijing/

Should it be Bei3 jing1_Li3_Gong1_Da4 xue2? otherwise it will be rendered as Běijīng LǐGōng Dàxué, instead of Běijīng Lǐ Gōng Dàxué

Same with:

古巴比倫 古巴比伦 [Gu3 Ba1 bi3 lun2] /ancient Babylon/Babylon, ancient/

That will be rendered as: GǔBābǐlún

ssb22 commented 2 years ago

Ah, that's a conversion issue. The original is Běijīng Lǐ-Gōng Dàxué, which should probably convert to Bei3 jing1_Li3-Gong1_Da4 xue2 in the ChinaScribe format, but I need to double-check if ChinaScribe can handle the hyphen.

ssb22 commented 2 years ago

Confirmed ChinaScribe 1.63 can cope with hyphens in its CEDICT-like format.

(ChinaScribe also now has a newer format of its own, which is closer to the first 5 columns of the main cedpane.txt if labelled Definitions Simplified Traditional Mandarin_1 Yale_1 except the definitions need to be /-separated and alternate orders provided. Using this new format for ChinaScribe imports might help if CedPane improves on ChinaScribe's default Cantonese conversion, plus it also has a way to specify segmentation priority, although the level numbers are limited and you'd have to sync it with the rest of the dictionary; I often prefer to add "phrase" entries to solve common segmentation misses. But I'm not so sure it's a good idea to drop the old ChinaScribe format from CedPane now a few other projects are using it. Fixing the hyphens conversion probably wouldn't do any harm though.)

chinese-words-separator commented 2 years ago

Confusing delimeters -_, I think they are meant as _

e.g.,

安城鄉 安城乡 [An1 cheng2-_Xiang1] /Ancheng- Township/Township, Ancheng-/
北隍城鄉 北隍城乡 [Bei3 huang2 cheng2-_Xiang1] /Beihuangcheng- Township/Township, Beihuangcheng-/
ssb22 commented 2 years ago

Yes, there's incorrect extra hyphens in 50 township entries. They were there before in cedpane.txt; the fix to the ChinaScribe conversion merely exposed the existing problem in that format too. I'm not quite sure how this got in; probably some mistake in a one-off conversion script I used to add some townships to the rest of the data. Will fix in the next update.

chinese-words-separator commented 2 years ago

I think the following..

八一鎮 八一镇 [Ba1-yi1 zhen4] /Ba-yi (town)/

..should be Ba1-yi1_zhen4. zhen4 is town

If there's no underscore, the pinyin will be compressed as: Bā-yīzhèn, there's no word yīzhèn

ssb22 commented 2 years ago

No I think this one is correct because town name plus 镇 doesn't usually have a space before the 镇 (rule 2.4). Yes this does seem inconsistent with cities but we should probably honour it anyway :)

ssb22 commented 2 years ago

Ah but the hyphen should not be there, it should be Bāyīzhèn

chinese-words-separator commented 2 years ago

Ah but the hyphen should not be there, it should be Bāyīzhèn

Hm.. that seems to runs counter to the examples given on 2.4 when I saw these examples, the zhen4 seems need to be separated:

Běijīng Shì (Beijing City), Dòngtíng Hú (Lake Dongting)

And then further reading, saw particular guidelines for towns (zhen4), villages (cun1), etc, the zhen4 and cun1 are attached to the names :)

image

So it's Bāyīzhèn 👍