tc39 / ecma402

Status, process, and documents for ECMA 402
https://tc39.es/ecma402/
Other
535 stars 105 forks source link

Hyphenation API #93

Open sebmarkbage opened 8 years ago

sebmarkbage commented 8 years ago

Related to line breaks in https://github.com/tc39/ecma402/issues/60

JS prior-art: https://github.com/mnater/hyphenator

Chromium issue implementing CSS hyphens: auto: https://bugs.chromium.org/p/chromium/issues/detail?id=605840

Chrome is currently the only browser not supporting automatic hyphenation based on dictionaries but is now finally getting it.

It would be good to expose this hyphenation dictionary to JS as well for the same reasons as the BreakIterator API.

However, you typically want to use this a lot less aggressively than the BreakIterator. For the BreakIterator it makes sense to iterate through every possible break point since you have to do that anyway. However, for hyphenation, you typically only want to test a word.

I'd suggest modeling this after Blink's API which uses an instance based on locale:

let hyphenator = new Intl.Hyphenation(locale);
let lastIndex = hyphenator.lastHyphenLocation(str, beforeIndex);

WebKit's API uses a static method:

let available = Intl.Hyphenation.canHyphenate(locale);
let lastIndex = Intl.Hyphenation.lastHyphenLocation(str, beforeIndex, locale);

It seems more inline with the rest of the Intl APIs to use an instance per locale here. Potentially gives engines some cache life time hints.

cc @littledan

caridy commented 8 years ago

@sebmarkbage this sounds good! who can champion this?

sebmarkbage commented 8 years ago

Blink's API is more like Intl.Hyphenation.get(locale) which returns null if there is no dictionary available or it doesn't make sense to hyphenate that language.

I suppose this could be solved with Intl.Hyphenation.supportedLocalesOf(locales) like the other APIs.

sebmarkbage commented 8 years ago

I'd be willing to co-champion this but I'd like to get help from someone working closer with actual implementors to get feedback. Ideally from Microsoft since I have no idea what their constraints are.

littledan commented 8 years ago

@sebmarkbage I'm all for this, if we have a reasonable API that meets the concerns on all sides. Could you give a straw-man API of what you'd like to use, and I can run it by hyphenation implementers in Chrome? Although I'm not an expert here, I can be a go-between on the Chrome side. Would you want the API to look just like that linked hyphenator library, or do you think it's possible to have something more minimal?

jungshik commented 8 years ago

@littledan This is where we have to think about "web api vs ecma 402 api", IMHO.

sebmarkbage commented 8 years ago

I'd like to make the case for this being in 402. Current implementations layer the visual offset and the logical offset.

Webkit: https://github.com/WebKit/webkit/blob/c39dc07ade51fade8a07c82e70bae823a04dc360/Source/WebCore/rendering/line/BreakingContext.h#L703-L707 (Both WebKit and Blink has the same path.)

Firefox simply extracts all logical breaks: https://dxr.mozilla.org/mozilla-central/source/dom/base/nsLineBreaker.cpp#95-98

You first have to get the character offset where the visual break happens and then find an appropriate logical break earlier in the word - without regard to visual representation. In theory, maybe you would want to break the word differently depending on if there's a ligature glyph in place but I doubt you would integrate this in an integrated way. You'd still layer that logic on top of the logical breaks.

If you use a custom font layout and rendering such as http://opentype.js.org then the font metrics information you'd want to use different logic for the visual concerns - which is neatly decoupled from the web API.

Therefore I think that the Hyphenation API, as purely using the logical representation (like character sets etc) does indeed belong in ECMA402.

sebmarkbage commented 8 years ago

@littledan This could plausibly be implemented on top of the Segmenter API if the segmenter API got a way to skip forward and search backwards. Something like word.advanceTo(offset).nextInReverse().

sebmarkbage commented 8 years ago

For some use cases outside the web:

jungshik commented 7 years ago

Just a couple of random thoughts.

I believe that non-web use cases alone wouldn't make this suitable for Ecma 402.

Note that hyphenation requires a rather large # of language-dependent data. One may point out that break iterator does as well, but # of locale-dependent tailoring of break iterators is only a few. OTOH, every language requiring hyphenation needs its own hyphenation rule data/dictionary. I'm not saying that this would exclude hyphenation api from Ecma 402.

We can think of having an API that takes externally provided rule data. In that case, we'd have to agree upon the data format, which could turn out hard.

zbraniecki commented 7 years ago

@jfkthame do you think we'd be interested in switching to that internally?

jfkthame commented 7 years ago

I suspect that switching to a JS hyphenation library for the internal implementation of CSS hyphens:auto would have an unwelcome performance cost (though I'd be happy to be proved wrong!)

sebmarkbage commented 7 years ago

Like other data in ECMA 402, a host implementation could provide other flags, APIs and data formats to define where this data lives if it is not embedded in the implementation itself. ECMA 402 wouldn't define the exact rules themselves.

littledan commented 7 years ago

@sebmarkbage Some complexity involved in a hyphenation API:

About a host environment providing other flags: I think we should actually get rid of that from 402 (discussion in #113). Regardless, those two items I listed above are more advanced outputs, not input flags. I think we would have to think this through upfront.

@jfkthame Agreed. A JS hyphenation API would be for users, and browsers could plug in at a lower level.