Closed chinese-words-separator closed 2 years ago
Having a mechanism to indicate that there is non-combined reading of a given compound characters (e.g., 都会, metropolis) can preclude the need to force dictionary makers to provide an alternative non-compound reading of compound characters, where providing alternative meaning to a non-compound reading of compound characters will just be cumbersome to make or redundant
And for Chinese segmentation softwares, it can show the most common use of a given compound characters. Of course, for Chinese segmentation softwares that don't have any AI, it should still list city/metropolis even on separated 都 and 会. Which is easier to accomplish than providing AI to the segmentation softwares that can figure out if 都会 is really 都会 or 都 会
我 都 会 喜欢 你
Good idea to do something about those “shouldn't often be combined” dictionary entries (my favourite annoyance is ABC's 人和=harmonious relations).
The main problem is supporting all the different formats, including Pleco and Wenlin. I guess our options are:
Right now I'm leaning toward option 2, but perhaps we should think about this more more first.
RE:
Add some “fake” dictionary entries, like 都会=dōu huì=will all (to balance out the “dūhuì, metropolis” reading in ABC etc).
I think the 米高's metres tall
is another example of fake dictionary entry. Asked wife's opinion if somehow 米高's metres tall
makes sense. I gave her this phrase:
两米高衣柜
She said the metre 米 is more attached to 两 than to 高. Thus, this is how that should be read
两米 高 衣柜
or even this:
两 米 高 衣柜
Rather than
两 米高 衣柜
Of course we will not provide copious entries like 两米, 三米,我都,等等 just to overcome the should not be often combined words like 都会, 米高. We would rather see segmentation be done like this:
两 米 高 衣柜
RE option 2:
we can add a word to the new file even before we have a proper CedPane entry for it (good for handling awkward words from other dictionaries)
Yes, definitely that's a pro. I'm leaning option 2 solution now. Awkward words need to be de-emphasized and have them not be prominently listed first in the dictionary software's definitions. And it being a separate file, we can fight those awkward words while the jury is out if something should even be defined, e.g., 家的 (old wife), 都会 (all will), 米高 (metres tall), we can de-emphasize those awkward definitions without needing to create an equally awkward definitions. We can just have a list of words that can signal the segmenting softwares that combining those consecutive characters is optional, that it can just list its definition where the most usual (separate 都 and 会) use of the consecutive characters is listed first before the unusual (都会 metropolis), e.g.,
都 = all
会 = will
都会 = metropolis
The segmentation software will often (always, if the software is not employing AI) display 都会 in its non-combined form:
我 都 会 喜欢 你
It's more embarassing for segmentation software makers to display 我 都会 喜欢 你
(I metropolis like you), than to display 大 都 会 很 干净
. Lacking the shouldn't often be combined file, it's impossible or hard to signal language learners of Chinese that there is a potential garden-path on the sentence they are reading if the segmentation software already asserted that the 都会 in the sentence 我都会喜欢你
is 都会 and not 都 会
And if we have that shouldn't often be combined separate file without definitions , we don't need to create 米高's metres tall
definition to balance out 米高's Michael definition. Thus segmentation softwares are not forced to combine 米 and 高 together
两 米 高 衣柜
If the shouldn't often be combined database includes 米高, segmentation software makers can choose to display it as米 高
separately, the dictionary software makers then can choose to de-emphasize 米高's Michael as well, thus 米 高
will be displayed by dictionaries as:
米 = metre
高 = tall
米高 = Michael
Currently, the fake metres tall
entry is just a workaround to prevent segmentation softwares from being able to display Michael only. metres tall
can be considered as fake entry, in the brain of native Chinese speakers, they have these discrete words: 两
, 米
, 高
, the discrete word 米高
is Michael in their brains, not metres tall
I'm looking forward to the optimal solution to solve this problem, leaning on solution number 2 as well
Thanks!
Watching Netflix with Chinese words separator, it's embarrassing to see compound words that usually should be read as separate words :) Ironically, the software is called Chinese words separator, but it often combine words than separate them
Here's the initial list of compound words that I compiled that usually should be read as separate words:
米高
到了
家的
都会
都會
中的
美的
得了
大树
大樹
面的
把門
把门
的话
的話
有了
那是
把头
把頭
才能
几号
幾號
我去
妈的
媽的
你等
一声
一聲
会要
道外
不想
你妈
你媽
So far I'm not encountering compound words that are three or more characters that goes out of context in the sentence they are in. The list above are all two character compound words
After including 你妈
on split list:
Thanks. Indeed "text segmentation" (or splitting or separation) does seem to be a misnomer, because all segmentation algorithms are really joiners, not splitters. If you give the code a completely empty dictionary, it would presumably default to calling every single character a separate word (unless for some reason you've programmed it to treat runs of unknown characters as new words, which is likely to be incorrect far more often than it's correct), so the real task is to figure out how and when words should be joined together. But few learners would know what you meant if you said you had a Chinese word joiner....
Reopening as the auto commit script somehow put only 2 words in the word-overrides file; there should have been more
The syntax is just adding additional
/
on the definition, e.g.,Generally, phrase such as 都会 often occurs in sentence as 'all will', not as metropolis (some dictionaries)
中的 is mostly referring to China's
把头 is mostly referring to someone's head, rather than 'labor contractor' or 'gangmaster' (some dictionaries)
https://context.reverso.net/翻译/中文-英语/把头
Most of the examples of 把头 on reverso's list are about someone's head rather than 'labor contractor' or gangmaster. Out of its hundred examples that list the use of 把头 as someone's head, I found only two examples on reverso that refers to 把头 as gangmaster, e.g.,
Dictionary software should optimize for common use
Having a mechanism in the CedPane database that signifies a word can be read in non-combined form, dictionary softwares can provide the language learner the list of possible interpretations of non-combined form, or just plainly include the meaning of each hanzi of the combined word
I would rather see 都会 be shown as this list (listing metropolis/city last)..
..than this (most browser extensions, e.g., Zhongwen):
The advantage of just appending slash is that it will not badly impact existing dictionary softwares (e.g., Zhongwen extension). Adjusting nothing on Zhongwen's code, it will render additional slash to 都会 as another semicolon:
For dictionary softwares that will take use of the additional information that a combined word can be read in its non-combined form, it can surely help the learners of Chinese to optimize how they read and learn Chinese
When the syntax is accepted, I'll help CedPane to add that information to CedPane's vocabularies, it's just adding slash anyway
Here are other list I collected that can be interpreted in its non-combined form:
An aside, I asked my wife (Chinese) if she know the word 家的 (defined as 'old wife' in some dictionaries), she said she don't know the word 家的 as 'old wife'. Even my OS's pinyin input method does not list 家的 for jiade, it shows five other words that matches jiade (e.g., 假的, 架得). If the syntax will be accepted, for surely I will add slash to 家的, it mostly refers to family's anyway. For dictionary softwares that recognizes the additional syntax, 家的 will be rendered as:
Listing (old wife) last