unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Hyphenation #164

Open sffc opened 4 years ago

sffc commented 4 years ago

From @zbraniecki in https://github.com/unicode-org/rust-discuss/issues/9:

@jfkthame wrote mapped_hyph which, as far as I know, is not overlapping with any other Rust Intl crate!

We should discuss if we want hyphenation in ICU4X, or if we want to leave it as a third-party crate that can build on top of Unicode properties data that we provide.

jfkthame commented 4 years ago

To my mind, hyphenation doesn't really fit very naturally into an ICU-like project. It's not got much to do with Unicode as such -- character properties, etc. -- except to the extent that the data involved will presumably be encoded using Unicode (but that would be equally true for any text-processing utility); it's more about language than script.

I suppose in theory hyphenation is related to locale, and the necessary data to drive it could be collected in CLDR, but AFAIK it's not currently there (is it?), so I see little reason to bring the hyphenation engine into the Unicode library.

sffc commented 4 years ago

I don't know all the details, but my assumption is that hyphenation is probably in the same functional unit as segmentation (word breaks, sentence breaks, line breaks). CLDR/Unicode defines rule-based segmentation, which works pretty well, but there are also ML-based segmentation engines that have better accuracy.

jfkthame commented 4 years ago

I don't think it is, really. For grapheme, word, and sentence segmentation it's possible to define a default behavior based on Unicode character properties that works pretty reasonably for most cases, but for hyphenation that's not the case; it is much more strongly language-dependent, and there is no "generic" fallback that is usable in the absence of language-specific rules.

alerque commented 4 years ago

I think it would make sense for development to keep hyphenation in a different crate. There are a number of ways hyphenation is dissimilar from other kinds of segmentation. As @jfkthame noted it's highly language dependent. This brings with it a raft of things like exception scenarios and even the need to override values. That's just not a feature of other Unicode character level properties, and I don't think the concerns will mix and match well.

sffc commented 4 years ago

I put the issue on the backlog to gather feedback. There is no plan to act on this ticket at this time.

Once we have a Unicode properties API ready in ICU4X, a downstream create implementing hyphenation can depend on us.

tapeinosyne commented 3 years ago

(I maintain the hyphenation crate and would be delighted to ensure it is suitable for the needs of an organized ICU initiative, either by working within or alongside it. I agree, however, that the crate itself needs not belong to the ICU4X component set.)

sffc commented 1 year ago

Discussion about hyphenation from Slack:

@Manishearth - hmm, Flutter has an interesting Hyphenation API that builds on top of ICU https://cs.opensource.google/flutter/engine/+/master:third_party/txt/src/minikin/Hyphenator.cpp I wonder if this is something we should have

@mihnita - I doubt that icu and unicode have enough info to do hyphenation.

@Manishearth - seems fair. markus also just told me something similar

@zbraniecki - how does this API pull in layout information?

@Manishearth - it seems to mostly use script and joining data

@zbraniecki - that's not hyphenation. that's not even line segmentation.

@mihnita - Here are (for example) the hyphenation rule files for tex: https://www.tug.org/tex-hyphen/ And even that is not enough. For example hunspell also has lists of exceptions

@Manishearth - yeah i'm still trying to find out what they're doing here, it seems like best-(low)-effort. was surprised to see a Hyphenation.cpp because i always assumed it was complex

@mihnita - What they might do there is figure out how to "render" the hyphenation if you have the info about where it should happen. What kind of hyphen character to use. And there are language where you are a hyphen both at the end of line and the beginning of next line:

Encyclo- -pedia

I think Dutch does that. And to break lines properly with hyphenation you have extra rules. For example you don't break the lines after one single syllable. And you don't break if it results in a "bad word". Android has 4 different options controlling the hyphenation algorithm: https://developer.android.com/reference/android/widget/TextView#attr_android:hyphenationFrequency

@aethanyc - Firefox uses https://github.com/jfkthame/mapped_hyph (written in Rust)