ssb22 / CedPane

Chinese-English Dictionary Public-domain Additions for Names Etc (CedPane)
http://ssb22.user.srcf.net/cedpane/
The Unlicense
4 stars 1 forks source link

Split overrides #51

Closed chinese-words-separator closed 1 year ago

chinese-words-separator commented 1 year ago

菲多 吃的 一行 我人 书哈 書哈 大雪 林前

Forgot to include the sources. I think some of them are commonly interpreted as two words

林前 https://www.xigushi.com/zlgs/14148.html#:~:text=白桦林前还有一泓透明的湖泊

ssb22 commented 1 year ago

Thanks. It's a pity about losing the sources for some of these, as I do wonder if in some cases adding another word or phrase entry might be better than just adding an override:

chinese-words-separator commented 1 year ago

Thanks. It's a pity about losing the sources for some of these, as I do wonder if in some cases adding another word or phrase entry might be better than just adding an override:

  • 菲多 is very often a name—I'd like to see the sentence where it wasn't, to see if there's some other way of catching that exception

The last one I found was used in mid-sentence, I did not save the link. Found another one, but the 菲 in 菲多 below is just an abbreviation of Philippines, easily causing 菲 to be included in a compound word:

https://www.flw.ph/thread-1106637-1-1.html#:~:text=菲多家银行网上转账免手续费,直到今年底

  • Yes to 吃的 it's obvious

  • 一行 not sure (if we have both the yīxíng and yīháng readings)

CC-CEDICT has 一行:

一行 一行 [yi1 xing2] /party/delegation/

Google has that as party, troupe

I've seen instances that 一行 is read as one row, not as party/delegation/troupe. It's unlikely that CC-CEDICT will allow yi1 hang2, as multi-words / compound words contributions that are merely AB = A + B (一行 = one row. 一 = one, 行 = row) are not accepted since the meaning can be easily inferred if the components of the compound word has very few glosses only. Besides, for learners who are not aware of the nuances of language segmentation, it would look odd to them that 一行 = one row is additionally defined in the dictionary, it will give them the impression that the dictionary is bloated, it will make them wonder if the dictionary is bloated with 二行, 三行 as well

Same rationale that CC-CEDICT applied to not accepting my meter high suggestion for additional definition for 米高 Michael. I concur with them on this principle now. Besides, meter high is hardly a valid compound word

I guess if I contribute a compound word tiger shark 虎鯊 to CC-CEDICT, it will not be accepted too, as there are exactly one gloss for each component of 虎鯊. It might be added to a dictionary if the definition of 虎鯊 has no one-to-one mapping with tiger and shark on the X language of other languages version of a cc-cXdict dictionary, let's say on a certain country the tiger shark does not make them think of tiger, it makes them think of other animal instead, say zebra, I digress

  • Yes to 我人 it's used in phrases like "my person" as well as "we"

Found this, in this instance it's not used as we

https://www.zhenhunxiaoshuo.com/nishifuworenshaqianduo/#:~:text=你师父我人傻钱多

  • Not sure about 书哈/書哈, I've only seen it as a name, except in phrases like 有声书哈利波特 which should be handled by the entries for 有声书 and 哈利

I remember (forgot to save the link) I found it on someone's expression about a certain book, the 书哈 is at the end of a sentence. Perhaps I can add a rule on CWS's parser that if a 2 syllable compound word ends with 哈 and if the compound word is at the end of the sentence, then just split it. Might not be a good rule if no one exclaim something on the first word of the compound word that ends with 哈, the good compound word will not surface to learner by default

  • 大雪 in the ABC is "dàxuě ①heavy snow ②Great Snow (21st solar term)" which looks OK. If CEDICT has just Great Snow then that's incomplete (after all it has an entry for 大雨 heavy rain), but we could put in an override for now as a workaround.

  • 林前 I'm wondering if it would be better to add an entry for 白桦林 (it seems there are enough white birch forests out there to justify our treating it as a phrase, and it would fix that particular sentence at least—and the fact that it has 白桦林前 and not just 林前 might suggest the writer thought just 林前 would not be clear enough and therefore we'd be unlikely to see it)

I think 桦林 would be enough, rationales:

https://www.youtube.com/watch?v=9v-eyk0BB5w https://www.youtube.com/watch?v=Iro19GB6fH8

Not sure why it's in quotes here:

http://www.tibet.cn/cn/ecology/202301/t20230130_7350910.html#:~:text=“桦林”之美不止于林

Creating an entry for 桦林 would also give sense to this: https://zh.wikipedia.org/zh-hans/樺林紅菇#:~:text=桦林红菇

ssb22 commented 1 year ago

Thanks, yes 菲多家 with 菲 being short for 菲律宾 is a good one as 多家 seems to be quite common in phrases (another example: 菲多家网站被黑), I don't think there's any sensible way to handle this other than overriding 菲多 (it's not as if 多家 would make a good entry).

Looks like 一行 as yīháng can mean "a line" as well as "profession", so yes we'll probably have to make that one an override too (no good way of working around this by adding other entries instead, which I'd always prefer to overrides if it can be done). I have in the past added entries like 下周三 and 十一点钟 to work around wrong splits, but that's generally best done when the number range is limited (you can add "all the days of the week" and "all the hours on the clock" quite easily, but 一千零一行 might be more of a problem).

虎鯊 is in the ABC as "bullhead shark"—this might be another “doves and pigeons” situation (鸽子 used for both). I'm inclined to think entries for things like well-known names of species ought to be OK even if they do break down straightforwardly. After all, the dictionary user might not know that "tiger shark" in English translates literally into the Chinese, so having the entry would at least tell them "yes it's OK to do this"... I've put in some entries after doing a bit of research into how something is said, finding that the correspondence is more literal than I expected, and thinking let's have an entry just to make that point. But it's up to them. I wouldn't have a CedPane entry for something that's in the ABC though (unless we're adding a very different meaning), because I'm trying to avoid stepping on the ABC's toes too much seeing as I keep the main copy in Wenlin (which has the ABC on mouseover).

人傻钱多 is netspeak for rich fool, will add an entry, but 我人 still needs override as there's other instances where it's not "we".

Yes I think splitting 哈 at end of sentence is probably a good idea, although that could go wrong if a sentence really does end with a name that ends in 哈 (likely quite rare)—or maybe treat like word-override, so the alternative reading is still there it's just not given first?

Yes 桦林's a good one.