tc39 / ecma402

Status, process, and documents for ECMA 402
https://tc39.es/ecma402/
Other
538 stars 107 forks source link

Intl.breakIterator #60

Closed caridy closed 2 years ago

caridy commented 8 years ago

Standardize Intl.v8BreakIterator.

Backpointers:

Update 1 (Sept 26th, 2016):

jungshik commented 8 years ago

/cc @jungshik, @littledan

srl295 commented 8 years ago

use cases:  rendering (Canvas etc… Thai…)  console rendering (word wrap!)  counting words/lines/sentences  Translation tooling…

caridy commented 8 years ago

@littledan will champion this one.

littledan commented 8 years ago

I think we'd probably want a somewhat different API for this compared to what V8 currently ships, if it's not too late backwards compatibility-wise. The current API looks like this (more docs here: https://code.google.com/p/v8-i18n/wiki/BreakIterator):

I think a more ES2015-y way to do it would be to have a method instance.breakText("my string") on the instance which returns an iterator over the breaks in the string. Each item would be an object like {index: 1, breakType: "letter"}. To put the cherry on top, we should probably make the sole method breakText not be a bound function, unlike the current five-method API of bound functions, if this is the general strategy for new APIs.

A possible downside is that this could have worse performance (for the object allocation, and also for accounting for the case where multiple strings are being iterated over by the same instance at the same time), but I don't think this proposal would introduce further implications for a high-performance implementation compared to lots of other ES2015 features. It would also mean making a brand new iterator in place of first--would this be very bad performance-wise?

What do you all think of this general API shape?

The first step towards this will be unshipping Intl.v8BreakIterator in V8, as the standardized version will likely be incompatible. Current usage is low, but nonzero, so we'll see how this goes. If there are a lot of complaints, then maybe I'll want to argue for sticking to the current API; or maybe the complaining users would be happy to hear that if they are OK with the new API, then they'll get the support in more browsers.

I don't think I'll be able to write up a proposal for the March TC39 meeting unfortunately.

littledan commented 8 years ago

I ended up deciding against unshipping v8BreakIterator in V8 when I unshipped several other nonstandard features (which all had much lower usage counts).

littledan commented 8 years ago

I wrote up a quick explainer doc explaining the motivation and a strawman API shape. It seems reasonable to me for this to include both line breaking and grapheme/word/sentence segmentation. Maybe hyphenation could go into the same API, just with a different type "hyphen" rather than an entirely different class (as I imagine the API would be similar).

Does anyone have any thoughts? I'm interested in both web developers and implementers.

mathiasbynens commented 8 years ago

The proposed API in https://github.com/littledan/BreakIterator#example looks great! I’m in favor of overloading the type to include 'hyphen' provided the API can remain similar.

sebmarkbage commented 8 years ago

I'm very worried about the performance of this API because the use of this API over native methods is going to be performance critical enough anyway. Additionally, anyone compiling native layout code to asm.js or wasm is going to want the lowest level possible access to that. I've seen nothing to indicate that iterables and the allocations it requires can be optimized away in existing engines. Can you even iterate over a significant document without causing multiple young generation GCs? I'd like to see something to suggest that perf concerns are unfounded before moving forward with the alternative design. Otherwise I fear we'll have to use a polyfill anyway.

EDIT: I suppose supporting both would be an ok tradeoff is iterables aren't fast enough yet. Similar to how other iterable APIs have alternative iteration APIs.

The hyphenation API should be different. Unlike line breaks it is often possible to find a hyphenation point in the middle of a string without iterating through all of the possible ones. Using the iterator API would be very inefficient.

The way you do text-layout hyphenation is by first measuring the unhyphenated word, and only then find the closest point to hyphenate if it is too long - which will give you a single direct value.

IMO we can just look at what browsers already do rather than trying to be clever. They're designed that way for a reason.

jungshik commented 8 years ago

I'd rather not include 'hyphen' in the proposed API.

In addition to what @sebmarkbage wrote, hyphenation can change the input in some languages (e.g. German).

littledan commented 8 years ago

@sebmarkbage To the performance concern: What if %SegmentIterator% had an additional "low-level API" with three methods, advance (to imperatively move to the next match, returning undefined; the user could tell if they are at the end by observing the index getting too high, or maybe this could return true at the end), index and breakType (to get properties of the current breakpoint)?

littledan commented 8 years ago

I updated the explainer with the low-level segmentation interface, though I won't be surprised if we got pushback for this. I assume it's OK to do an allocation when adopting a different piece of text to perform segmentation over, right?

sebmarkbage commented 8 years ago

Short pieces of text are likely to be combined into a single string rope often.

I'm not as concerned about those allocations since new pieces of text are often associated with allocations anyway. The allocations are proportional. E.g. you might have <span>a</span> <em>lot</em> of small <strong>segments</strong> and iterate through them independently but the number of allocations is proportional to the allocations you do for the data structures holding them anyway.

SebastianZ commented 8 years ago

In addition to what @sebmarkbage wrote, hyphenation can change the input in some languages (e.g. German).

As far as I know, the only case in German to which this applied is hyphenation between c and k, which turned the c into another k. E.g. 'Zucker' became 'Zuk-ker'. This rule no longer applies since the orthography reform from 1996.

So, at least in German there is no such issue anymore, though I have no idea if other languages still have similar rules.

Sebastian

sebmarkbage commented 8 years ago

@SebastianZ there are a few other cases mentioned here http://www.unicode.org/L2/L2002/02279-muller.htm#4 for example in Swedish "tuggummi" becomes "tugg-gummi".

I think it is fairly rare to handle these special cases correctly but it'd be good for the API to handle it.

jungshik commented 8 years ago

We also need to support 'strictness' (for lack of a better term) either as a separate option or as values of 'type'.

CSS3 has 'strict', 'normal', 'loose' (and 'auto') for line-break and ICU/CLDR support them. (when v8BreakIterator was written, there's no such distinction).

littledan commented 8 years ago

I filed a new issue for strictness at https://github.com/littledan/Segmenter/issues/5 . Let's migrate all additional discussion of feature requests related to segmentation to that repository.

ryzokuken commented 3 years ago

Since Intl.Segmenter is almost done, can we close this?

sffc commented 3 years ago

Since Intl.Segmenter is almost done, can we close this?

I think it should be closed when #553 is merged. I'll add it as a linked issue.