ssb22 / CedPane

Chinese-English Dictionary Public-domain Additions for Names Etc (CedPane)
http://ssb22.user.srcf.net/cedpane/
The Unlicense
4 stars 1 forks source link

Words that are written in three different ways #53

Open chinese-words-separator opened 1 year ago

chinese-words-separator commented 1 year ago

CC-CEDICT: 乾淨 干净 [gan1 jing4] /clean/neat/ 怎麼 怎么 [zen3 me5] /how?/what?/why?/ 怎麽 怎么 [zen3 me5] /variant of 怎麼|怎么[zen3 me5]/ CedPane: 乾淨 干淨 [gan1 jing4] /clean/neat and tidy (variant)/ 怎麼 怎麽 [zen3 me5] /how (variant)/

gānjìng is written as 乾淨, 干净, 干淨

Also it looks like the convention of CC-CEDICT is that it maps the variant (怎麽) on the left side of simplified(怎么): 怎麽 怎么 [zen3 me5] /variant of 怎麼|怎么[zen3 me5]/

While CedPane maps the variant(怎麽) on the right of the traditional(怎麼): 怎麼 怎麽 [zen3 me5] /how (variant)/

Not sure which dictionary got it correct for both traditional and simplified on words that are identified written in three different ways. And not sure which convention is better/correct on mapping the variants, maybe there is no better/correct way when it comes to variant, just need to be consistent. if I'm not mistaken though, I noticed that variants on CC-CEDICT are all on left sides

Sharing this findings

Here are the others:

乾淨~干净~干淨    gānjìng
事實勝於雄辯~事实胜于雄辩~事實勝于雄辯    shìshí shèngyú xióngbiàn
亙古不變~亘古不变~亘古不變  gèngǔ-bùbiàn
先天與後天~先天与后天~先天與后天   xiāntiān yǔ hòutiān
千層麵~千层面~千層面 qiāncéngmiàn
商舖~商铺~商鋪    shāngpù
對摺~对折~对摺    duìzhé
小範圍~小范围~小范圍 xiǎofànwéi
市場佔有率~市场占有率~市場占有率   shìchǎng zhànyǒulǜ
復盤~复盘~複盤    fùpán
徵稅~征税~征稅    zhēngshuì
怎麼~怎么~怎麽    zěnme
採石場~采石场~采石場 cǎishíchǎng
揚穀~扬谷~揚谷    yánggǔ
旅遊團~旅游团~旅游團 lǚyóutuán
旅遊景點~旅游景点~旅游景點  lǚyóu jǐngdiǎn
書籤~书签~书籤    shūqiān
極端份子~极端分子~極端分子  jíduān fènzǐ
標準尺寸~标准尺寸~標准尺寸  biāozhǔn chǐcùn
橫槓~横杠~橫杠    hénggàng
武裝份子~武装分子~武裝分子  wǔzhuāng fènzǐ
準繩~准绳~准繩    zhǔnshéng
燈臺~灯台~燈台    dēngtái
生命體徵~生命体征~生命體征  shēngmìng tǐzhēng
穫獎者~获奖者~獲獎者 huòjiǎngzhě
空氣淨化器~空气净化器~空氣凈化器   kōngqì jìnghuàqì
純淨~纯净~纯淨    chúnjìng
網誌~网志~網志    wǎngzhì
縫製~缝制~縫制    féngzhì
縱慾~纵欲~纵慾    zòngyù
聯繫方式~联系方式~聯系方式  liánxì fāngshì
腦性麻痺~脑性麻痹~脑性麻痺  nǎoxìng mábì
蘇打餅乾~苏打饼干~蘇打餅干  sūdá bǐnggān
衝嚮~冲向~衝向    chōngxiàng
製氧機~制氧机~制氧機 zhìyǎngjī
複選框~复选框~復選框 fùxuǎnkuàng
規範性~规范性~規范性 guīfànxìng
註冊商標~注册商标~注冊商標  zhùcè shāngbiāo
資金槓桿~资金杠杆~资金槓杆  zījīn gànggǎn
迴旋處~回旋处~回旋處~迴旋处 huíxuánchù
那麼~那么~那麽    nàme
重覆性~重复性~重複性 chóngfùxìng
鋼製~钢制~鋼制    gāngzhì
除淨~除净~除凈    chújìng
電臺~电台~電台    diàntái
非標準~非标准~非標准 fēibiāozhǔn
鬆脫~松脱~鬆脱    sōngtuō
鹹豬手~咸猪手~咸豬手 xiánzhūshǒu
麵包機~面包机~面包機 miànbāojī
雕鴞~雕鸮~鵰鴞    diāoxiāo
丹稜縣~丹棱县~丹棱縣 Dānléng Xiàn
云城區~云城区~雲城區 Yúnchéng Qū
仙遊縣~仙游县~仙游縣 Xiānyóu Xiàn
佈蘭森~布兰森~布蘭森 Bùlánsēn
信豐縣~信丰县~信丰縣 Xìnfēng Xiàn
南嶽區~南岳区~南岳區 Nányuè Qū
南豐縣~南丰县~南丰縣 Nánfēng Xiàn
印臺區~印台区~印台區 Yìntái Qū
叢臺區~丛台区~叢台區 Cóngtái Qū
台東市~台东市~臺東市 Táidōng Shì
周杰倫~周杰伦~周傑倫 Zhōu Jiélún
咸豐縣~咸丰县~咸丰縣 Xiánfēng Xiàn
哈囉德~哈罗德~哈羅德 Hāluódé
國臺辦~国台办~國台辦 Guó-Tái-Bàn
大餘縣~大余县~大余縣 Dàyú Xiàn
宋干節~宋干节~宋乾節 Sònggānjié
宜豐縣~宜丰县~宜丰縣 Yífēng Xiàn
寶豐縣~宝丰县~寶丰縣 Bǎofēng Xiàn
岱嶽區~岱岳区~岱岳區 Dàiyuè Qū
崑士蘭~昆士兰~昆士蘭 Kūnshìlán
嶽塘區~岳塘区~岳塘區 Yuètáng Qū
嶽麓區~岳麓区~岳麓區 Yuèlù Qū
廣豐縣~广丰县~廣丰縣 Guǎngfēng Xiàn
張勳~张勋~張勛    Zhāng Xūn
後龍鎮~后龙镇~后龍鎮 Hòulóngzhèn
新豐縣~新丰县~新丰縣 Xīnfēng Xiàn
新豐鄉~新丰乡~新丰鄉 Xīnfēng Xiāng
於田縣~于田县~于田縣 Yútián Xiàn
於都縣~于都县~于都縣 Yúdū Xiàn
東豐縣~东丰县~東丰縣 Dōngfēng Xiàn
檯安縣~台安县~台安縣 Tái'ān Xiàn
民豐縣~民丰县~民丰縣 Mínfēng Xiàn
永豐縣~永丰县~永丰縣 Yǒngfēng Xiàn
沃羅湼日~沃罗涅日~沃羅涅日  Wòluónièrì
海澱區~海淀区~海淀區 Hǎidiàn Qū
海豐縣~海丰县~海丰縣 Hǎifēng Xiàn
清豐縣~清丰县~清丰縣 Qīngfēng Xiàn
漢臺區~汉台区~漢台區 Hàntái Qū
瀋陽~沈阳~沈陽    Shěnyáng
瀋陽市~沈阳市~沈陽市 Shěnyáng Shì
灣裡區~湾里区~灣里區 Wānlǐ Qū
甕安縣~瓮安县~瓮安縣 Wèng'ān Xiàn
甦仙區~苏仙区~蘇仙區 Sūxiān Qū
甦家屯區~苏家屯区~蘇家屯區  Sūjiātún Qū
當涂縣~当涂县~當塗縣 Dāngtú Xiàn
石臺縣~石台县~石台縣 Shítái Xiàn
礄口區~硚口区~硚口區 Qiáokǒu Qū
祿豐縣~禄丰县~祿丰縣 Lùfēng Xiàn
範縣~范县~范縣    Fàn Xiàn
綏稜縣~绥棱县~綏棱縣 Suíléng Xiàn
臺灣省~台湾省~台灣省 Táiwānshěng
興隆臺區~兴隆台区~興隆台區  Xīnglóngtái Qū
華東師範大學~华东师范大学~華東師范大學    Huádōng Shīfàn Dàxué
西豐縣~西丰县~西丰縣 Xīfēng Xiàn
豐南區~丰南区~丰南區 Fēngnán Qū
豐台區~丰台区~丰台區 Fēngtái Qū
豐寧~丰宁~丰寧    Fēngníng
豐寧縣~丰宁县~丰寧縣 Fēngníng Xiàn
豐滿區~丰满区~丰滿區 Fēngmǎn Qū
豐潤區~丰润区~丰潤區 Fēngrùn Qū
豐澤區~丰泽区~丰澤區 Fēngzé Qū
豐縣~丰县~丰縣    Fēng Xiàn
豐都縣~丰都县~丰都縣 Fēngdū Xiàn
豐鎮~丰镇~丰鎮    Fēngzhèn
豐鎮市~丰镇市~丰鎮市 Fēngzhèn Shì
豐順縣~丰顺县~丰順縣 Fēngshùn Xiàn
貞豐縣~贞丰县~貞丰縣 Zhēnfēng Xiàn
達坂城區~达坂城区~達阪城區  Dábǎnchéng Qū
金臺區~金台区~金台區 Jīntái Qū
錢鐘書~钱钟书~錢鍾書 Qián Zhōngshū
鍾祥~钟祥~鐘祥    Zhōngxiáng
鍾祥市~钟祥市~鐘祥市 Zhōngxiáng Shì
長豐縣~长丰县~長丰縣 Chángfēng Xiàn
陸豐~陆丰~陸丰    Lùfēng
陸豐市~陆丰市~陸丰市 Lùfēng Shì
雨花臺區~雨花台区~雨花台區  Yǔhuātái Qū
霑益縣~沾益县~沾益縣 Zhānyì Xiàn
餘干縣~余干县~余干縣 Yúgān Xiàn
餘慶縣~余庆县~余慶縣 Yúqìng Xiàn
餘杭區~余杭区~余杭區 Yúháng Qū
餘江縣~余江县~余江縣 Yújiāng Xiàn
鬱南縣~郁南县~郁南縣 Yùnán Xiàn
麟遊縣~麟游县~麟游縣 Línyóu Xiàn
龍遊縣~龙游县~龍游縣 Lóngyóu Xiàn
ssb22 commented 1 year ago

OK so the topic of variants is more complicated than it should be to say the least😊

The first thing to note is that “traditional” and “simplified” does not really apply to a word, but to a character. But we can't just have a character normalisation table that says for example “every time you see 里 in Simplified, it’s 裡 in Traditional” because it isn’t—it’s 裡 in Traditional only in the specific instances where it means “inside”, which is not always. Similarly, there are Traditional characters that might or might not map to specific Simplified characters depending on their context. There are plenty of character mappings that really do work 100% of the time, which is why I have the --post-normalise option on my Annotator Generator—that reduces the download size of my Android apps, because I can eliminate alternate forms of most words just by feeding all input characters through a normalisation table before starting to split, but this does not work on all characters, so I still need to handle the others. Thankfully, most Chinese text does not mix up Simplified and Traditional characters at random in the same word—it’ll be 为什么 (Simplified) or 為什麼 (Traditional), but not usually 为什麼 (Simplified first character, Traditional last character)—well that specific example can be handled by my normalisation table, but if it couldn’t then it's usually safe to pretend that there exist Simplified Chinese words and Traditional Chinese words, and hence have "Simplified" and "Traditional" fields at a word level in dictionaries. Which also means software doesn't have to have a normalisation table—if you don't, you'll just have to ship a bit more data, and you just might not cope so well with that one-in-a-million page where someone mixes Traditional and Simplified in the same word, usually due to some very old Traditional-to-Simplified conversion software having gone wrong.

Now, enter variant characters. Unlike traditional and simplified, the variant characters really do get mixed in with non-variant ones almost at random (but usually on Traditional pages though: it's rarer to get variants on Simplified pages). So if we have a 5-character entry and 2 variants exist for each of 3 of those characters, then in theory there are 27 possible variations of that word—it's that bad. Thankfully though, the normalisation table does a much better job of ironing out the variants—and I haven't really found I needed to add so many "variant" entries since I implemented the table. The variant entries I did add were based on words I found in real texts (rather than "theoretically this is possible")—the process goes "try to read a text, oops it's getting split wrongly, can I add an entry to fix that". It's possible that some of those entries can now be removed if they'll be handled properly by the table anyway, but I didn't go out on a "crusade" to remove them all: after all, not all software has a normalisation table, and those entries are versions that I've actually seen in real text so we might as well keep them. (If you give Annogen a table, it automatically applies it to all the input text and de-duplicates the resulting rules, so it's not bothered by too many variants.) But on the other hand I've not been collecting so many new variants recently.

Also: usually when I put "variant" I meant "this word contains at least one variant character", but occasionally I just meant "this is a really weird way of writing the word but I've seen it done". At some point I should probably check through and clarify which entry is which.

In the case of 干淨/乾淨 the "variant" part is the second character, 淨. The "normal" version of this (at least in this word) would be 净 in Simplified, 凈 in Traditional—one less little stroke in the middle of the left-hand side, easy to miss. So the logic I applied was "well, looking at the rest of the word, i.e. the first character, 干 is simplified and 乾 is traditional, so I will put 干淨 into "simplified" and 乾淨 into "traditional", and "variant" into the definition. Actually the labels "traditional" and "simplified" make less sense in the presence of variants (we are very much backward-fitting data into a format that wasn't originally designed for it), but at least we can make them apply to the non-variant part of the word.

In the case of 怎麽, we're using 麽 U+9EBD instead of 麼 U+9EBC. Tiny difference in strokes and only one digit different in Unicode numbers; I even have a book whose paper form has 麼 but whose electronic form has 麽, so I'm suspecting the variant 麽 arose from a typo in some old-character-set-to-Unicode conversion table that the publisher's software was using (but if publishers are putting this out then we do need to recognise it when it happens, typo or not—I'm not saying we have to recognise every typo that ever happens, but really common ones, such as ones arising from little bugs in publishers' character-set conversion tables that then got used in many books, might be worth addressing, hence the 怎麽 and 多麽 entries). Now, Wenlin's zidian lists the "wrong" U+9EBD 麽 as being a simplified character, and says its traditional equivalent is, guess what, 麼 U+9EBC. So we could view this word as being "a weird way of writing the Simplified word 怎么" that's equivalent to normal Traditional 怎麼. So I decided to put the variant form into the Simplified field in this case (putting it into the Traditional field would probably make Wenlin 3 not accept the entry, although I haven't actually tried this), and the Traditional field is just a copy of the normal word's Traditional field.

chinese-words-separator commented 1 year ago

I researched, scratch that, I asked ChatGPT, it looks like in ideal simplification world, there should be no spelling variants of a word that should arise or be created anymore. So with 干淨 for a simplified-using person, either the person did not get the memo and still continue using 淨 instead of 净, or just chosen 淨 for various reasons. It is also a possible that kind of spelling is coming from a traditional-using person. Like with the word Taiwan, even Taiwan people themselves are using a simple version of 臺; though they keep 灣 for wan1, they also use 台 for 台灣. 台灣 spelling is considered a variant in Taiwan, and is not a variant in simplified writing of mainland China, thus it is not slotted to simplified column in CC-CEDICT. Even for people in Taiwan, they see the convenience in using simplified versions of some characters, e.g., 台, I guess 干淨 can also be written by someone from Taiwan, so it is possible that 干淨 variant is apt to be in traditional column instead

CC-CEDICT
臺灣 台湾 [Tai2 wan1] /Taiwan/
台灣 台湾 [Tai2 wan1] /variant of 臺灣|台湾[Tai2 wan1]/

CedPane:
台灣 台湾 [Tai2 wan1] /Taiwan/

First inquiry: image

Second inquiry: image

Third inquiry: image

Fourth inquiry: image

Maybe these are the reasons why as far as I can tell, that most variants in CC-CEDICT are in traditional column. It can also be attributed to CC-CEDICT volunteers doing deletions of typos and misspellings, no matter how common they are in a writing system. Misspellings and typos aside, it can't be discounted that variants still do arise in simplified writing; as what detailed by ChatGPT, variants also come from regional and differences in cultural practices and beliefs, and also from government-led initiatives, like in 2013

ssb22 commented 1 year ago

ChatGPT is copying a wrong but popular misconception about the origin of "Simplified" Chinese characters. Although some new Simplified characters were indeed invented in the 1950s, many had been around for centuries before that, just not favoured by the "educated". A lot of what happened in the 1950s was taking the characters that the "unschooled" were already using and standardising them. In fact it appears that some of these simplified characters might have come on the scene even before their "traditional" equivalents, which is why some sinologists prefer to write "full form" and "simple form" instead of "traditional" and "simplified", so as not to make any implication about which one came first in any particular case.

ChatGPT is also getting a little confused and inconsistent about the exact meaning of "variant". I am thinking of the Chinese word 异体字, which has a narrower meaning (commonly-used simplified and traditional characters are not considered to be 异体字), and I'm also thinking of the Unicode Project's Han Unification data and its "variant" fields.

I still think it's a bit artificial to try to shoehorn variant forms into "simplified" and "traditional" columns. If I were defining the format again from scratch, maybe I'd have a list of versions of the word, with 2 extra bits on each item to flag which item(s) are common in simplified and which item(s) are common in traditional, some both, some neither. What we're doing with the current format is a compromise because the format makes us label everything as "simplified" or "traditional" and there's no column for "we don't have enough data to meaningfully say whether this is simplified or traditional". If you're writing a converter between simplified and traditional then I'd suggest setting it to de-prioritise any entry that says "variant" in its definition.

chinese-words-separator commented 1 year ago

Although some new Simplified characters were indeed invented in the 1950s, many had been around for centuries before that, just not favoured by the "educated"

True, reminds me of this topic The TRUE Origins of Simplified Chinese. Instead of thinking that simplified characters were simplified, it's more accurate to think that some traditional characters were complexified from their simple form, some simplified characters were already there since time immemorial and some of them predates the considered "traditional"

Cloud: "Simplified" 云 = 1,200 BC "Traditional" 雲 = 200 BC

爱 was there since Jin dynasty

I still think it's a bit artificial to try to shoehorn variant forms into "simplified" and "traditional" columns. If I were defining the format again from scratch, maybe I'd have a list of versions of the word, with 2 extra bits on each item to flag which item(s) are common in simplified and which item(s) are common in traditional, some both, some neither. What we doing with the current format is a compromise because the format makes us label everything as "simplified" or "traditional" and there's no column for "we don't have enough data to meaningfully say whether this is simplified or traditional".

Agree, slotting a variant writing in either column is prone to misinterpretation