w3c / jlreq

Text Layout Requirements for Japanese
https://w3c.github.io/jlreq/
Other
101 stars 17 forks source link

Should Japanese verbs be segmented by morpheme for selection? #212

Open r12a opened 4 years ago

r12a commented 4 years ago

I've been exploring the results of double-clicking on Japanese text. See a summary of the results of some exploratory tests. See https://github.com/w3c/character_phrase_tests/issues/30 for a discussion about what happens when clicking inside verbs.

Firefox only selects adjacent rows of characters from the same Unicode block when you double-click in the middle of a sentence.

Chrome and Safari, however, apply some logic to the resulting selection, so that if you click inside the word 歩きました (depending on where you click) the browser highlights one of the following morphological segments, 歩き, ま, or した.

The latter is what ICU does. See the ICU segmentation demo page.

My question is whether that's what the average Japanese user wants to happen, or not? Are they happy to be able to select the word root (including incidental hiragana, as in 歩き), or do they get frustrated that they have to constantly extend the selection to get what they want, ie. the whole word?

kidayasuo commented 4 years ago

A good question. I feel the current behaviour (of Safari / Chrome. Firefox is ancient) generally works as I expect esp. in terms of particles and words that has an inflection. They are units when I edit text for improvements.

I once in a while frustrated by compound Kanji words. They typically dissect them in smallest chunks but often the unit I want to edit / replace it the whole compound word. English has the same issue however with compound words. City of “Palo Alto” is two words however they are inseparable. “Palo Alto town hall” is four words, etc. Only the difference is that word boundaries are not visible (+ ambiguities) in Japanese text.

xfq commented 4 years ago

I once in a while frustrated by compound Kanji words. They typically dissect them in smallest chunks but often the unit I want to edit / replace it the whole compound word. English has the same issue however with compound words. City of “Palo Alto” is two words however they are inseparable. “Palo Alto town hall” is four words, etc. Only the difference is that word boundaries are not visible (+ ambiguities) in Japanese text.

I agree that a greedy match is a good default behavior. However, there are also cases I just want to match one stem of a compound (like 田舎 in 田舎育ち), and a lazy match would help, so I think it would be useful to make the rules customizable.

kidayasuo commented 4 years ago

I agree. Given expanding the range is easier than reducing the range, I think the current behaviour of matching a smaller semantic unit is a reasonable one.

r12a commented 4 years ago

The thing i was particularly curious about is not so much the separation of 歩き from ました, which i can also see some usefulness for, although it's not the sort of thing that's done in most languages (for example it doesn't happen in Korean). I was particularly curious to know whether also separating ま from した was useful or irritating (to the average Web user)? I can see the logic, but i'd be quite interested to hear if ordinary users appreciate the morphological segmentation that is applied to Japanese.

kidayasuo commented 4 years ago

I think you need to ask elsewhere (than github) if you want to hear what “ordinary users” say ;)

A challenge of asking ordinary users is that they typically do not know/remember what they want/do unless they had extremely pleasant (unlikely with selecting a range) or unpleasant experience. The best way would be to observe what they do rather than just to ask. The flip side is experts. They will explain out of their knowledge, often ignoring what they might actually feel.

It is possible that the separation of ま/した is not optimal. However allowing selecting した seems reasonable if you want to change it to ません (it might be because I know the grammar. I am not certainly an ordinary user), it can still be non-intuitive to ordinary users.

Also, if you take the ability of input methods into account, selecting wider range, e.g. “bunsetsu” segment unit, might make more sense at this point. Because current input methods are not good at converting such a short string, users would often need to type it as a complete bunsetsu even if they had initially selected a shorter range. Future input methods might solve this issue by looking at the surrounding text. Actually some input methods already have this ability however it is still limited.

yes, this is an interesting area to explore.

murata2makoto commented 4 years ago

@r12a Wakati-gaki is relevant, since it inserts space between small units.

There are several rules of wakati-gaki. I once studied them and found that there is no consensus. I even found that elementary school textbooks are not always consistent. Moreover, dictionaries for computers sometimes have some compound words but do not have others.