Open chinese-words-separator opened 1 year ago
OK so the topic of variants is more complicated than it should be to say the least😊
The first thing to note is that “traditional” and “simplified” does not really apply to a word, but to a character. But we can't just have a character normalisation table that says for example “every time you see 里 in Simplified, it’s 裡 in Traditional” because it isn’t—it’s 裡 in Traditional only in the specific instances where it means “inside”, which is not always. Similarly, there are Traditional characters that might or might not map to specific Simplified characters depending on their context. There are plenty of character mappings that really do work 100% of the time, which is why I have the --post-normalise
option on my Annotator Generator—that reduces the download size of my Android apps, because I can eliminate alternate forms of most words just by feeding all input characters through a normalisation table before starting to split, but this does not work on all characters, so I still need to handle the others. Thankfully, most Chinese text does not mix up Simplified and Traditional characters at random in the same word—it’ll be 为什么 (Simplified) or 為什麼 (Traditional), but not usually 为什麼 (Simplified first character, Traditional last character)—well that specific example can be handled by my normalisation table, but if it couldn’t then it's usually safe to pretend that there exist Simplified Chinese words and Traditional Chinese words, and hence have "Simplified" and "Traditional" fields at a word level in dictionaries. Which also means software doesn't have to have a normalisation table—if you don't, you'll just have to ship a bit more data, and you just might not cope so well with that one-in-a-million page where someone mixes Traditional and Simplified in the same word, usually due to some very old Traditional-to-Simplified conversion software having gone wrong.
Now, enter variant characters. Unlike traditional and simplified, the variant characters really do get mixed in with non-variant ones almost at random (but usually on Traditional pages though: it's rarer to get variants on Simplified pages). So if we have a 5-character entry and 2 variants exist for each of 3 of those characters, then in theory there are 27 possible variations of that word—it's that bad. Thankfully though, the normalisation table does a much better job of ironing out the variants—and I haven't really found I needed to add so many "variant" entries since I implemented the table. The variant entries I did add were based on words I found in real texts (rather than "theoretically this is possible")—the process goes "try to read a text, oops it's getting split wrongly, can I add an entry to fix that". It's possible that some of those entries can now be removed if they'll be handled properly by the table anyway, but I didn't go out on a "crusade" to remove them all: after all, not all software has a normalisation table, and those entries are versions that I've actually seen in real text so we might as well keep them. (If you give Annogen a table, it automatically applies it to all the input text and de-duplicates the resulting rules, so it's not bothered by too many variants.) But on the other hand I've not been collecting so many new variants recently.
Also: usually when I put "variant" I meant "this word contains at least one variant character", but occasionally I just meant "this is a really weird way of writing the word but I've seen it done". At some point I should probably check through and clarify which entry is which.
In the case of 干淨/乾淨 the "variant" part is the second character, 淨. The "normal" version of this (at least in this word) would be 净 in Simplified, 凈 in Traditional—one less little stroke in the middle of the left-hand side, easy to miss. So the logic I applied was "well, looking at the rest of the word, i.e. the first character, 干 is simplified and 乾 is traditional, so I will put 干淨 into "simplified" and 乾淨 into "traditional", and "variant" into the definition. Actually the labels "traditional" and "simplified" make less sense in the presence of variants (we are very much backward-fitting data into a format that wasn't originally designed for it), but at least we can make them apply to the non-variant part of the word.
In the case of 怎麽, we're using 麽 U+9EBD instead of 麼 U+9EBC. Tiny difference in strokes and only one digit different in Unicode numbers; I even have a book whose paper form has 麼 but whose electronic form has 麽, so I'm suspecting the variant 麽 arose from a typo in some old-character-set-to-Unicode conversion table that the publisher's software was using (but if publishers are putting this out then we do need to recognise it when it happens, typo or not—I'm not saying we have to recognise every typo that ever happens, but really common ones, such as ones arising from little bugs in publishers' character-set conversion tables that then got used in many books, might be worth addressing, hence the 怎麽 and 多麽 entries). Now, Wenlin's zidian lists the "wrong" U+9EBD 麽 as being a simplified character, and says its traditional equivalent is, guess what, 麼 U+9EBC. So we could view this word as being "a weird way of writing the Simplified word 怎么" that's equivalent to normal Traditional 怎麼. So I decided to put the variant form into the Simplified field in this case (putting it into the Traditional field would probably make Wenlin 3 not accept the entry, although I haven't actually tried this), and the Traditional field is just a copy of the normal word's Traditional field.
I researched, scratch that, I asked ChatGPT, it looks like in ideal simplification world, there should be no spelling variants of a word that should arise or be created anymore. So with 干淨 for a simplified-using person, either the person did not get the memo and still continue using 淨 instead of 净, or just chosen 淨 for various reasons. It is also a possible that kind of spelling is coming from a traditional-using person. Like with the word Taiwan, even Taiwan people themselves are using a simple version of 臺; though they keep 灣 for wan1, they also use 台 for 台灣. 台灣 spelling is considered a variant in Taiwan, and is not a variant in simplified writing of mainland China, thus it is not slotted to simplified column in CC-CEDICT. Even for people in Taiwan, they see the convenience in using simplified versions of some characters, e.g., 台, I guess 干淨 can also be written by someone from Taiwan, so it is possible that 干淨 variant is apt to be in traditional column instead
CC-CEDICT
臺灣 台湾 [Tai2 wan1] /Taiwan/
台灣 台湾 [Tai2 wan1] /variant of 臺灣|台湾[Tai2 wan1]/
CedPane:
台灣 台湾 [Tai2 wan1] /Taiwan/
First inquiry:
Second inquiry:
Third inquiry:
Fourth inquiry:
Maybe these are the reasons why as far as I can tell, that most variants in CC-CEDICT are in traditional column. It can also be attributed to CC-CEDICT volunteers doing deletions of typos and misspellings, no matter how common they are in a writing system. Misspellings and typos aside, it can't be discounted that variants still do arise in simplified writing; as what detailed by ChatGPT, variants also come from regional and differences in cultural practices and beliefs, and also from government-led initiatives, like in 2013
ChatGPT is copying a wrong but popular misconception about the origin of "Simplified" Chinese characters. Although some new Simplified characters were indeed invented in the 1950s, many had been around for centuries before that, just not favoured by the "educated". A lot of what happened in the 1950s was taking the characters that the "unschooled" were already using and standardising them. In fact it appears that some of these simplified characters might have come on the scene even before their "traditional" equivalents, which is why some sinologists prefer to write "full form" and "simple form" instead of "traditional" and "simplified", so as not to make any implication about which one came first in any particular case.
ChatGPT is also getting a little confused and inconsistent about the exact meaning of "variant". I am thinking of the Chinese word 异体字, which has a narrower meaning (commonly-used simplified and traditional characters are not considered to be 异体字), and I'm also thinking of the Unicode Project's Han Unification data and its "variant" fields.
I still think it's a bit artificial to try to shoehorn variant forms into "simplified" and "traditional" columns. If I were defining the format again from scratch, maybe I'd have a list of versions of the word, with 2 extra bits on each item to flag which item(s) are common in simplified and which item(s) are common in traditional, some both, some neither. What we're doing with the current format is a compromise because the format makes us label everything as "simplified" or "traditional" and there's no column for "we don't have enough data to meaningfully say whether this is simplified or traditional". If you're writing a converter between simplified and traditional then I'd suggest setting it to de-prioritise any entry that says "variant" in its definition.
Although some new Simplified characters were indeed invented in the 1950s, many had been around for centuries before that, just not favoured by the "educated"
True, reminds me of this topic The TRUE Origins of Simplified Chinese. Instead of thinking that simplified characters were simplified, it's more accurate to think that some traditional characters were complexified from their simple form, some simplified characters were already there since time immemorial and some of them predates the considered "traditional"
Cloud: "Simplified" 云 = 1,200 BC "Traditional" 雲 = 200 BC
爱 was there since Jin dynasty
I still think it's a bit artificial to try to shoehorn variant forms into "simplified" and "traditional" columns. If I were defining the format again from scratch, maybe I'd have a list of versions of the word, with 2 extra bits on each item to flag which item(s) are common in simplified and which item(s) are common in traditional, some both, some neither. What we doing with the current format is a compromise because the format makes us label everything as "simplified" or "traditional" and there's no column for "we don't have enough data to meaningfully say whether this is simplified or traditional".
Agree, slotting a variant writing in either column is prone to misinterpretation
CC-CEDICT: 乾淨 干净 [gan1 jing4] /clean/neat/ 怎麼 怎么 [zen3 me5] /how?/what?/why?/ 怎麽 怎么 [zen3 me5] /variant of 怎麼|怎么[zen3 me5]/ CedPane: 乾淨 干淨 [gan1 jing4] /clean/neat and tidy (variant)/ 怎麼 怎麽 [zen3 me5] /how (variant)/
gānjìng is written as 乾淨, 干净, 干淨
Also it looks like the convention of CC-CEDICT is that it maps the variant (怎麽) on the left side of simplified(怎么): 怎麽 怎么 [zen3 me5] /variant of 怎麼|怎么[zen3 me5]/
While CedPane maps the variant(怎麽) on the right of the traditional(怎麼): 怎麼 怎麽 [zen3 me5] /how (variant)/
Not sure which dictionary got it correct for both traditional and simplified on words that are identified written in three different ways. And not sure which convention is better/correct on mapping the variants, maybe there is no better/correct way when it comes to variant, just need to be consistent. if I'm not mistaken though, I noticed that variants on CC-CEDICT are all on left sides
Sharing this findings
Here are the others: