ssb22 / CedPane

Chinese-English Dictionary Public-domain Additions for Names Etc (CedPane)
http://ssb22.user.srcf.net/cedpane/
The Unlicense
4 stars 1 forks source link

Proposing a syntax that signifies a word can also be read in its non-combined form #11

Closed chinese-words-separator closed 2 years ago

chinese-words-separator commented 2 years ago

The syntax is just adding additional / on the definition, e.g.,

都會 都会 [du1 hui4] /city/metropolis//

Generally, phrase such as 都会 often occurs in sentence as 'all will', not as metropolis (some dictionaries)

中的 is mostly referring to China's

把头 is mostly referring to someone's head, rather than 'labor contractor' or 'gangmaster' (some dictionaries)

https://context.reverso.net/翻译/中文-英语/把头

Most of the examples of 把头 on reverso's list are about someone's head rather than 'labor contractor' or gangmaster. Out of its hundred examples that list the use of 把头 as someone's head, I found only two examples on reverso that refers to 把头 as gangmaster, e.g.,

他们不会把头和小兵都抓起来
'They don't arrest the kingpin and his underlings.'
我们都把头剃了 看着会像光头党的
'When you get out of here, we'll all shave our heads and look like a gang of skinheads.'

Dictionary software should optimize for common use

Having a mechanism in the CedPane database that signifies a word can be read in non-combined form, dictionary softwares can provide the language learner the list of possible interpretations of non-combined form, or just plainly include the meaning of each hanzi of the combined word

I would rather see 都会 be shown as this list (listing metropolis/city last)..

都 dōu
all; both; entirely; (used for emphasis) even; already; (not) at all
都 dū
capital city; metropolis
会 huì
can (i.e. have the skill, know how to); likely to; sure to; to meet; to get together; meeting; gathering; union;
group; association; a moment (Taiwan pr. for this sense is [hui3])
会 kuài
to balance an account; accountancy; accounting
都会都會 dū huì
city; metropolis

..than this (most browser extensions, e.g., Zhongwen):

都会都會 dū huì
city; metropolis
都 Dū
surname Du
都 dōu
all; both; entirely; (used for emphasis) even; already; (not) at all
都 dū
capital city; metropolis

The advantage of just appending slash is that it will not badly impact existing dictionary softwares (e.g., Zhongwen extension). Adjusting nothing on Zhongwen's code, it will render additional slash to 都会 as another semicolon:

都会都會 dū huì
city; metropolis;

For dictionary softwares that will take use of the additional information that a combined word can be read in its non-combined form, it can surely help the learners of Chinese to optimize how they read and learn Chinese

When the syntax is accepted, I'll help CedPane to add that information to CedPane's vocabularies, it's just adding slash anyway

Here are other list I collected that can be interpreted in its non-combined form:

米高
到了
家的
都会
都會
中的
美的
得了
大树
大樹
面的
把門
把门
的话
的話
有了
那是
把头
把頭
才能

An aside, I asked my wife (Chinese) if she know the word 家的 (defined as 'old wife' in some dictionaries), she said she don't know the word 家的 as 'old wife'. Even my OS's pinyin input method does not list 家的 for jiade, it shows five other words that matches jiade (e.g., 假的, 架得). If the syntax will be accepted, for surely I will add slash to 家的, it mostly refers to family's anyway. For dictionary softwares that recognizes the additional syntax, 家的 will be rendered as:

家 jīa
home; family: (polite) my (sister, uncle etc); classifier for families or businesses; refers to the philosophical
schools of pre-Han China; noun suffix for a specialist in some activity, such as a musician or revolutionary,
corresponding to English -ist, -er, -ary or -ian;
的 de
of; ~'s (possessive particle); (used after an attribute); (used to form a nominal expression); (used at the end of
a declarative sentence for emphasis); also pr. [di4] or [di5] in poetry and songs
的 di
see 的士[ di1 shi4]
的 dí
really and truly
aim; clear
家的 jīa de
(old) wife

Listing (old wife) last

chinese-words-separator commented 2 years ago

Having a mechanism to indicate that there is non-combined reading of a given compound characters (e.g., 都会, metropolis) can preclude the need to force dictionary makers to provide an alternative non-compound reading of compound characters, where providing alternative meaning to a non-compound reading of compound characters will just be cumbersome to make or redundant

And for Chinese segmentation softwares, it can show the most common use of a given compound characters. Of course, for Chinese segmentation softwares that don't have any AI, it should still list city/metropolis even on separated 都 and 会. Which is easier to accomplish than providing AI to the segmentation softwares that can figure out if 都会 is really 都会 or 都 会

我 都 会 喜欢 你
ssb22 commented 2 years ago

Good idea to do something about those “shouldn't often be combined” dictionary entries (my favourite annoyance is ABC's 人和=harmonious relations).

The main problem is supporting all the different formats, including Pleco and Wenlin. I guess our options are:

  1. Your suggestion of putting a hidden code into the definition, like ending it with an extra semicolon or slash. Pros: can be adapted to all the formats + won't break anything (at worst, existing software just shows an extra semicolon at the end). Cons: existing software doesn't know what it really means.
  2. Have an entirely separate file, just listing words we might want to ignore (characters only). Pros: absolutely no change needed to the existing CedPane formats; new file can be worked on separately; we can add a word to the new file even before we have a proper CedPane entry for it (good for handling awkward words from other dictionaries). Cons: existing software still can't use it (but I'm leaning toward thinking option 2 might be better than option 1, as the cons are the same and the pros are better)
  3. Add some “fake” dictionary entries, like 都会=dōu huì=will all (to balance out the “dūhuì, metropolis” reading in ABC etc). I have some entries like this in my private dictionary, which I generally haven't put into CedPane but perhaps I should. (Most of them have a comment like "this entry added to outweigh ABC's reading of..." which I can miss out, but it might still be a good idea to mark them in some way so that new software knows what's going on.) Pros: works with all existing software. Cons: results in some awkward word-groupings in existing software: OK I can tolerate 都会="will all" (at least it's better than dūhuì in most cases), but I'd rather have 都=all 会=can.
  4. Add some more "phrase" entries for reasonably-common phrases containing the affected words. Pros: probably the best solution for Wenlin 4, where long phrases are not shown but the individual words in them are still shown as written in the long-phrase entry. Cons: we will not be able to think of all the possible phrases in advance. (I have some long-phrase private entries, which I also added in to the training data of my Annogen-generated apps—Annogen was designed to figure out rules like “show it like X only if the context contains Y” and then I give it lots of example phrases and sentences and hope for the best—which turns out to be an OK approach although still not perfect—but I haven't been able to put all the long-phrase entries into CedPane because in many cases there haven't really been enough search results to count them as public domain, rather than just copying a particular phrase out of the particular article I was reading at the time.)

Right now I'm leaning toward option 2, but perhaps we should think about this more more first.

chinese-words-separator commented 2 years ago

RE:

Add some “fake” dictionary entries, like 都会=dōu huì=will all (to balance out the “dūhuì, metropolis” reading in ABC etc).

I think the 米高's metres tall is another example of fake dictionary entry. Asked wife's opinion if somehow 米高's metres tall makes sense. I gave her this phrase:

两米高衣柜

She said the metre 米 is more attached to 两 than to 高. Thus, this is how that should be read

两米 高 衣柜
or even this:
两 米 高 衣柜

Rather than

两 米高 衣柜

Of course we will not provide copious entries like 两米, 三米,我都,等等 just to overcome the should not be often combined words like 都会, 米高. We would rather see segmentation be done like this:

两 米 高 衣柜

RE option 2:

we can add a word to the new file even before we have a proper CedPane entry for it (good for handling awkward words from other dictionaries)

Yes, definitely that's a pro. I'm leaning option 2 solution now. Awkward words need to be de-emphasized and have them not be prominently listed first in the dictionary software's definitions. And it being a separate file, we can fight those awkward words while the jury is out if something should even be defined, e.g., 家的 (old wife), 都会 (all will), 米高 (metres tall), we can de-emphasize those awkward definitions without needing to create an equally awkward definitions. We can just have a list of words that can signal the segmenting softwares that combining those consecutive characters is optional, that it can just list its definition where the most usual (separate 都 and 会) use of the consecutive characters is listed first before the unusual (都会 metropolis), e.g.,

都 = all
会 = will
都会 = metropolis

The segmentation software will often (always, if the software is not employing AI) display 都会 in its non-combined form:

我 都 会 喜欢 你

It's more embarassing for segmentation software makers to display 我 都会 喜欢 你 (I metropolis like you), than to display 大 都 会 很 干净. Lacking the shouldn't often be combined file, it's impossible or hard to signal language learners of Chinese that there is a potential garden-path on the sentence they are reading if the segmentation software already asserted that the 都会 in the sentence 我都会喜欢你 is 都会 and not 都 会

And if we have that shouldn't often be combined separate file without definitions , we don't need to create 米高's metres tall definition to balance out 米高's Michael definition. Thus segmentation softwares are not forced to combine 米 and 高 together

两 米 高 衣柜

If the shouldn't often be combined database includes 米高, segmentation software makers can choose to display it as米 高 separately, the dictionary software makers then can choose to de-emphasize 米高's Michael as well, thus 米 高 will be displayed by dictionaries as:

米 = metre
高 = tall
米高 = Michael

Currently, the fake metres tall entry is just a workaround to prevent segmentation softwares from being able to display Michael only. metres tall can be considered as fake entry, in the brain of native Chinese speakers, they have these discrete words: , , , the discrete word 米高 is Michael in their brains, not metres tall

I'm looking forward to the optimal solution to solve this problem, leaning on solution number 2 as well

Thanks!

chinese-words-separator commented 2 years ago

Watching Netflix with Chinese words separator, it's embarrassing to see compound words that usually should be read as separate words :) Ironically, the software is called Chinese words separator, but it often combine words than separate them

image

Here's the initial list of compound words that I compiled that usually should be read as separate words:

米高
到了
家的
都会
都會
中的
美的
得了
大树
大樹
面的
把門
把门
的话
的話
有了
那是
把头
把頭
才能
几号
幾號
我去
妈的
媽的
你等
一声
一聲
会要
道外
不想
你妈
你媽

So far I'm not encountering compound words that are three or more characters that goes out of context in the sentence they are in. The list above are all two character compound words

After including 你妈 on split list:

image

ssb22 commented 2 years ago

Thanks. Indeed "text segmentation" (or splitting or separation) does seem to be a misnomer, because all segmentation algorithms are really joiners, not splitters. If you give the code a completely empty dictionary, it would presumably default to calling every single character a separate word (unless for some reason you've programmed it to treat runs of unknown characters as new words, which is likely to be incorrect far more often than it's correct), so the real task is to figure out how and when words should be joined together. But few learners would know what you meant if you said you had a Chinese word joiner....

ssb22 commented 2 years ago

Reopening as the auto commit script somehow put only 2 words in the word-overrides file; there should have been more