ssb22 / CedPane

Chinese-English Dictionary Public-domain Additions for Names Etc (CedPane)
http://ssb22.user.srcf.net/cedpane/
The Unlicense
4 stars 1 forks source link

Split override: 会长 #59

Closed chinese-words-separator closed 1 year ago

chinese-words-separator commented 1 year ago

Reference: https://www.youtube.com/watch?v=orfhB33Mf3M&t=2460s#:~:text=这是经常拿枪的手才会长的老茧

會長 会长 [hui4 zhang3] /president of a club, committee etc/

But it can be read as will grow/develop too

ssb22 commented 1 year ago

This is a borderline case. Sampling a corpus of old magazine articles suggests 会长 is a real word about 90% of the time it occurs, and a "false positive" for the other 10%. A 10% "false positive" rate is high enough to be a concern, although not as high as some words we've seen. The tricky question is "would the harm of not recognising it when we should, be greater than the harm of recognising it when we shouldn't?" It will still be the second choice, but I wonder if the cháng reading of 长 would be an even worse first choice.

But we do have other options. We can add "phrase" entries for the exceptions. I wouldn't want to do that if it meant adding thousands of phrases, but in this case I think we can catch most of the exceptions I saw in the magazine sample using just 3 phrase entries: 才会长, 会长出 and 会长满. By the way, the next character after 才会长 is often 得 but I've never seen 的, so I'm suspecting that YouTube subtitle was automatically generated and the 的 should be a 得 (but this won't matter for 会长 if we make the "phrase" entry 才会长 not 才会长得).

Another option is to use a "weight" bias: if we can bias 才会 to "weigh" more than 会长, we're telling it to prefer 才会+长 to 才+会长. I keep wondering whether to publish some "weights" for some of these words, where "weight" means "something a bit like frequency, but not really frequency because we can override it to work around problems", but I'm not sure how that would be matched up with other data sets. So I've tended to prefer adding "phrase" entries to catch exceptions.

chinese-words-separator commented 1 year ago

By the way, the next character after 才会长 is often 得 but I've never seen 的, so I'm suspecting that YouTube subtitle was automatically generated and the 的 should be a 得 (but this won't matter for 会长 if we make the "phrase" entry 才会长 not 才会长得).

Likely. I tested now the browser's built-in segmenter, it got the segmentation correctly when using 得 instead of 的. The 长得 weigh more than 会长, whereas 会长 weigh more than 长的. So I think the split override is not needed anymore for 会长

["这","是","经常","拿","枪","的","手","才","会长","的","老茧"]

["这","是","经常","拿","枪","的","手","才","会","长得","老茧"]

Is there a sentence where 会_长的 is grammatically correct? (still learning the language 😄, and even native speakers are tripping on the three de) If none, I'll not include the 会长 in split overrides if it is just a typo or flawed machine-generation

chinese-words-separator commented 1 year ago

I jumped the gun, it should be 长_得 not 长得. 长得 has a different sense too

長得 长得 [zhang3 de5] /to look (pretty, the same etc)/

The browser's built-in segmenter weighing mechanism can't produce segmentation with better sense

ssb22 commented 1 year ago

Interesting: I hadn't realised that in late 2020 Safari 14 and Chrome 87 (but not Firefox) added Intl.Segmenter from the new ECMAScript Internationalization API specification. Looking at the source code for Chromium's implementation of this, it's a wrapper around Unicode's ICU BreakIterator library, which uses RBBIStateTable code contributed to Unicode by IBM in 2016, and it looks like the ICU state table for Chinese was derived from this cjdict.txt which was worked on at Google and IBM by combining old libtabe with frequency data from Google web crawls. They did have CC-CEDICT in there too, but then they realised:

"The original work contains words taken from CC-CEDICT distributed under CC-SA license. However, CC-SA license is not compatible with ICU's MIT/X style license, all of CC-CEDICT unique words were removed from the data."

(the commit for this was in November 2012). So the word-break algorithm in modern ICU and the browsers is basically looking at the word list from the 1999 TaBE project plus frequency data derived from Google web crawls (and possibly other tweaks).

The word list file in the libtabe download is in libtabe/tsi-src/tsi.src (it uses the Big5 character set so you'll likely need iconv to see it on a modern system), it contains 130,000+ entries and was apparently put together by Pai-Hsiang Hsiao when he was a research assistant at Academia Sinica for just 1 year in 1998-99 (it looks like he's now at TripAdvisor). Most of the TaBE words look good, although I did see a couple I could argue aren't really words. My guess is he will have put that file together using an early version of the Academia Sinica Balanced Corpus of Modern Chinese, which was first published in 1995. I've not managed to find a downloadable copy of the Sinica corpus, but I have used another one of the period (the PH corpus) and I found PH's manual word-splitting does contain bugs (some of which I've corrected but haven't yet figured out whether I'm allowed to publish a corrected version)—I'd imagine getting lots of people to manually word-split a corpus is going to result in some mistakes when people get tired of the job, not all of which will have been picked up by their proofreaders. So my guess is that the few "not really words" in the TaBE list were caused by bugs in the manual splitting of the Sinica Corpus back in the 1990s, but I've not checked how many of these are still around in ICU's version of the wordlist that's now used by browsers.

(None of that will affect these particular words though....)

chinese-words-separator commented 1 year ago

(None of that will affect these particular words though....)

Yes, gardenpath problem is hard to tackle (perhaps impossible) in a segmenter algo. So for the meantime, for borderline cases, splitting a word to individual characters and leave it to reader how to interpret is a good recourse. One good example is 家的, although it is defined as (old) wife, I've yet to see use it as such; many times it is use for family/home stuff

image

Despite splitting 家_的, CWS still need to show 家的 (old) wifein the dictionary's list; so the learners will be aware of other readings of the 家 and 的 combination

So for 长得/長得, since it can also be read not just as to look (pretty, the same etc) in some sentences, e.g.,

image

It's good to just include 长得/長得 in split overrides

image

Aside from 长得/長得, got to include 会长/會長 in split overrides too. Otherwise if 会长/會長 is not included in split overrides and just the 长得/長得, the 会 will attach to 长, i.e.,

image

chinese-words-separator commented 1 year ago

I think I need to make CWS's tokenizer be sophisticated or have some more rules. Otherwise if CWS keep relying on word split list, learners won't be able to mark words that are split as learned. 会长 is included in HSK6, learners should be able to mark it as learned

ssb22 commented 1 year ago

Yes text segmentation is hard & researchers are still trying to come up with new ways to do it😊

Try importing latest CedPane with 才会长 added, that might help a bit although it's not ideal I know.

For "Pinyin Web" Chrome & Firefox extension and "Pinyin Web & EPUB" on Android I made Annotator Generator which tries to figure out "context" rules (inspired by Yarowsky's algorithm for word-sense disambiguation), so if you give it the right examples it could come up with a rule like "don't put 会长 if there's a 才 within 3 bytes of the start". In Annogen parlance, the 才 can be a "negative indicator" for the word 会长. Some words have "positive indicators" which means we recognize them only if one of the indicators is nearby; others have "negative indicators" meaning we recognize them by default unless one of those indicators is nearby. That can quite easily cope with multiple reading cases like 差 (it's probably chāi if it's near 他, etc) although I did have fun breaking that one in a talk recently. The main problems with my approach are (1) making sure you've got enough example sentences for it to do a decent job making the rules and (2) figuring out how to integrate it with a traditional weight-driven frequency approach, because right now I can't use weighting data if I'm using indicators😞 there's got to be some way to do both at the same time I just haven't come up with it yet. (Annogen does have some code that tries to translate its rules into weights for a more traditional segmenter to use, but it's only partial and it can't do the reverse translation.)