unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

LineSegmenter should return a breakpoint at 0 #3283

Closed sffc closed 1 year ago

sffc commented 1 year ago

As part of the API review with @markusicu, he pointed out something I had noticed before and we discussed on Slack, which is that LineSegmenter does not return a breakpoint at index 0.

Here is the Slack conversation:

@sffc: Is it expected that WordSegmenter finds a breakpoint at index 0 but LineSegmenter does not?

@aethanyc: Yes, it's expected per spec. LB2: Never break at the start of text. https://www.unicode.org/reports/tr14/#LB2 WB1 & WB2: Break at the start and end of text, unless the text is empty. https://www.unicode.org/reports/tr29/#WB1

@eggrobin: Besides « the spec says so » (which may be good enough an explanation for ICU4X, but not so much for the Unicode Standard), there is a distinction in purpose between the line breaking algorithm, which returns line break opportunities, and the segmentation algorithms, which return the boundaries of some kind of segment. The initial segment always has a boundary at 0, but the line break opportunities do not delimit anything meaningful, and breaking the line before the text isn’t going to help you make it fit.

@aethanyc: Thanks for the insight! Yeah, for actual line breaking layout engine, line break opportunity at 0 is not meaningful. I found the spec has another explanation: These two rules are designed to deal with degenerate cases, so that there is at least one character on each line, and at least one line break for the whole text. https://www.unicode.org/reports/tr14/#LB2

CC @aheninger @macchiati @frankyftang @makotokato @Manishearth to weigh in.

Manishearth commented 1 year ago

I don't know if there's much more I can say beyond what @eggrobin has said: not linebreaking at 0 makes sense here. We should probably document the concept of linebreak opportunities on the relevant APIs, if not done already.

markusicu commented 1 year ago

I don't know why UAX14 says not to emit a line break at the start of text (LB2), maybe @macchiati can say. However, the segmenter APIs I have seen (ICU and Google-internal and I think BasisTech) always segment text, that is, when you do something like Rust Iterator.tuple_windows() you partition the whole input text into segments.

This is just how such an API should work. And yes, the start and end of the text are not line break opportunities in some sense, but it's trivial to always ignore the first boundary if you don't want segments.

And LB3 says "Always break at the end of text." It would be weird to break at one but not the other, and hostile to have different segmenters behave differently at the edges of text.

eggrobin commented 1 year ago

UAX14 is the way it is because it is not specified as a segmentation algorithm (contrary to sentences, words, or graphemes, the segments are not the significant thing, the break opportunities are). See under https://www.unicode.org/reports/tr14/#LB3 for an explanation of the invariants it tries to maintain instead.

If it is convenient for this API to behave like a segmentation algorithm, it could do that, but the documentation needs some wording to caution that the first thing produced by the iterator is not actually a break. It would be nice to have some usage examples that depend on that behaviour.

Note that when it comes to actually laying out the text, to treat the start of text as a break opportunity would be nonconformant, as LB2 and LB3 are in the non-tailorable part (and as has been mentioned, it would be silly anyway: if your unbreakable line is too long, leaving a blank line above won’t do much good).

sffc commented 1 year ago

Should it be named LineBreaker instead of LineSegmenter? 🤔

Manishearth commented 1 year ago

Yes, I think that makes some sense.

Manishearth commented 1 year ago

Though the problem is that a true line breaker needs further information to actually break lines. What we really have is LineBreakOpportunityFinder but that's a terrible name.

zbraniecki commented 1 year ago

1) If 0 is never meaningful, then keeping an API that always returns it to be always ignored feels like a bad design. 1.1) I agree that not returning 0 but returning end of text feels awkward. I'm wondering what's the goal of LB3. My mental model that breaking is useful between two text fragments. What's the goal of flagging end of text as a break opportunity? Especially if it happens always? When would the implementation not ignore it? What's the use case? 2) Reading the conversation makes me understand that Segmenter should return ranges (of segments) rather than single indexes (of break opportunities). 2.1) Is there any known use case of line breaking that is not looking for segments? My understanding is that there is no meaningful difference between returning segments and line break opportunities here - no additional information, or control flow, comes from taking one over the other. 2.2) If that's the case, then the distinction is purely aesthetic - "Line Segmenter segments text by break opportunities" and "Line Breaker breaks lines into segments" seem interchangeable.

eggrobin commented 1 year ago

Is there any known use case of line breaking that is not looking for segments?

I suppose my question is whether there is there a use case of line breaking case that is looking for segments; while any sequence of inter-codepoint positions can be used to define segments, the resulting segments would be very weird things here.

One thing to note is that, while the API does not expose that, the breaks defined by UAX14 come in two flavours (mandatory breaks and break opportunities), which are clearly a property of the breaks, not of the segments.

In contrast all breaks are the same as defined in UAX29; ICU additionally classifies the word boundaries by status (Letter, Number, None), but that really is a property of the segments (numbers are segments whose final boundary has status number, words are segments whose final boundary has status letter).

zbraniecki commented 1 year ago

the breaks defined by UAX14 come in two flavours (mandatory breaks and break opportunities), which are clearly a property of the breaks, not of the segments.

Agree.

My position is that we don't need to name this LineBreaker to talk about Break Opportunities. LineSegmenter is a good name for an API which allows for segmentation, by iterating over break opportunities. I think this model is my suggested one here.

makotokato commented 1 year ago

LB2 is sot x, so we don't return 0. But when I tst ICU4C (ubrk), it will return 0 for ubrk_first... If first offset is 0 is convenience, I can change it.

Manishearth commented 1 year ago
  1. My mental model that breaking is useful between two text fragments. What's the goal of flagging end of text as a break opportunity? Especially if it happens always? When would the implementation not ignore it? What's the use case?

In part, "end of text" is also often "between two text fragments", so users can rely on this when concatenating bits of text. The pair of properties "no empty lines" and "text contains at least one break" is useful: this lets you model lines as a list of characters where there is logically a break after the list and you don't have to deal with a hypothetical "there are characters but no breaks" case. In that sense LB2 and LB3 are interop smoothers: they ensure implementations can be certain in using a general useful class of implementation details for linebreaking without having to trawl through UAX 14 implementation code.

Is there any known use case of line breaking that is not looking for segments?

Actually I kinda feel like the main use case of line breaking is not looking for segments: you can model it as looking for segments that it then "merges" into lines, but that's just creating break opportunities with more steps. I agree that they are under the hood logically equivalent,

If that's the case, then the distinction is purely aesthetic - "Line Segmenter segments text by break opportunities" and "Line Breaker breaks lines into segments" seem interchangeable.

I think part of the problem here is that the latter implies that the segments are linebreaks, which they're not (not yet).

markusicu commented 1 year ago

Please think of your users and make the API behave consistently.

macchiati commented 1 year ago

I don't know why UAX14 says not to emit a line break at the start of text (LB2), maybe @macchiati can say.

The reason is as already described in the thread; implementers don't want to ever line-break at the start of the line. But for all other segmentation work you want to return the start and end of the text as boundaries; otherwise it complicates the caller code unnecessarily. And it isn't crazy to want to get the segments between line-break boundaries; a common technique is to get the segment containing the width boundary, and then apply hyphenation. Or to use dynamic programming to get the optimal line-breaks within a whole paragraph, etc. (eg Knuth's algorithm, maybe also updated to allow for kerning between segments).

I agree that always returning a boundary at the start of the text makes for a more consistent API — you'd just want to carefully document that the first position was not a line break position according to UAX#14.

aethanyc commented 1 year ago

Is there any known use case of line breaking that is not looking for segments?

All the line break iterator [^1] usages in Firefox are not operating on segments . They all iterate over line breakpoints to find the longest substring whose length is not exceeding a wrap length. For example, usages in PlainTextSerializer or XMLContentSerializer.

That being said, it doesn't matter whether 0 is a line breakpoint or not in Firefox's usecases. So I'm OK to change ICU4X line segmenter to return 0 so that the API behavior is consistent with other UAX29 segmenter and with ICU4C.

[^1]: Firefox's current line break iterator is a legacy one, not conforming to UAX 14 standard. We want to replace it with ICU4X line segmenter of course.

eggrobin commented 1 year ago

I wrote:

If it is convenient for this API to behave like a segmentation algorithm, it could do that […] It would be nice to have some usage examples that depend on that behaviour.

@macchiati responded:

it isn't crazy to want to get the segments between line-break boundaries; a common technique is to get the segment containing the width boundary, and then apply hyphenation. Or to use dynamic programming to get the optimal line-breaks within a whole paragraph, etc. (eg Knuth's algorithm, maybe also updated to allow for kerning between segments).

Aha, that is the kind of example I was looking for. I wonder whether I could squish something like that into an actual example in the documentation…