tc39 / proposal-intl-segmenter

Unicode text segmentation for ECMAScript
https://tc39.github.io/proposal-intl-segmenter/
146 stars 16 forks source link

Word segmenter with generic locale #136

Open my2iu opened 3 years ago

my2iu commented 3 years ago

I have a bunch of strings of text of unknown language/locale, and I want to find the word breaks so that I can lay them out in paragraphs in svg. Is there some way to create a word segmenter using some sort of generic or default locale? Would, say, the English segmenter still properly handle Chinese text? Or do I need some sort of language detection to figure out the locale of the segmenter I need to instantiate?

FrankYFTang commented 3 years ago

What you need is line break. Some languages are using word break to determine line break but some are not. For example, Chinese and Japanese are not needed to line break at word boundary but English, Thai, Arabic and Khmer are.

Notice for JavaScript Intl.Segmenter, there are no support for line break but only word break. The early draft include the support the line break mode but engineer from Apple Safari team oppose that and claim that will encourage incorrect use of the facility. I try to convince them the need for JavaScript line break for SVG but didn't get enough supports from others. If you feel strongly we need to support a line break mode for that please voice it up

my2iu commented 3 years ago

Yes, but my understanding was that it would take years for anyone to agree on an API for line breaking. I assumed the compromise was that the word segmenting API was designed to be rich enough to allow programmers to construct a rudimentary line segmenter themselves.

The no/generic locale thing is important to multilingual users who are forced to type Chinese/Arabic/etc. into English web apps. Or English expats typing English into local Japanese websites. etc. Not everything can be localized properly for all languages.

FrankYFTang commented 3 years ago

Yes, but my understanding was that it would take years for anyone to agree on an API for line breaking. I assumed the compromise was that the word segmenting API was designed to be rich enough to allow programmers to construct a rudimentary line segmenter themselves.

No. That is exactly why I think it is a bad idea to not support Line Break. The compromise was Apple engineer believe if we do support line break, then there are no body will attempt to use ECMA to build a line break and use html and css to line break instead. They claim if we support line break, then people will NOT use CSS + html to do line break and not support line break and no one will use word break to do that. I am afraid if we do not support line break, then people will use word break incorrectly to implement line break. And your reasonable totally fulfill my prediction. Word Break is NOT a subset of Line Break. Line Break is neither a subset of Word Break. They follow two different systems. For SOME language, the line break may depend on word break but that is only on those languages.

It will be nice if you can argue your use cases of WHY you need to break the line by yourself but not depend on CSS to line break for you instead. Apple engineer believe all who need line break COULD use CSS line break facility to do that and should not use JavaScript to perform that by themselves, in particular line break also need the information of glyph boundary, which is not accessible from JavaScript.

FrankYFTang commented 3 years ago

would take years for anyone to agree on an API for line breaking

That is not the reason- the reason is in order to decide line breaking, it require two thing together

  1. the logical line break points - where the text COULD break
  2. the font metrics - and the layout NEED BOTH to perform line break but JavaScript has no support of (2) now.
my2iu commented 3 years ago

Well, that’s a little silly then. There’s a lot of people trying to standardize custom font rendering that need this. HTML/CSS are too high level for modern HTML5 web apps. Anyone doing HTML5 canvas games, WebGL, WebVR, WebXR, svg charts and graphs, visualization, or anything graphical will have their own low-level text rendering. I, myself, have written my own vector graphics tool in HTML5, and I had to use low-level libraries Typr.js and Harfbuzz to do my text rendering. And I need a low-level internationalization library to do my word breaking and line breaking, but the full ICU with all its dictionaries and stuff is too big to include on a web page. Since all web browsers already include the ICU in some way, I always thought the whole purpose of this standardization effort was to provide an API to let programmers access it instead of forcing them to download it or to use some server solution.

That is not the reason- the reason is in order to decide line breaking, it require two thing together the logical line break points - where the text COULD break the font metrics - and the layout NEED BOTH to perform line break but JavaScript has no support of (2) now.

Yeah, I assumed the hold-up was that you couldn’t agree on an API for this, not that you couldn’t agree on whether line breaking was actually needed.

FrankYFTang commented 3 years ago

not that you couldn’t agree on whether line breaking was actually needed.

Actually, THAT IS the disagreement. I believe it is needed but there are OTHERS believe line break is not needed as a JavaScript API. It is very easy for me who implement V8 to add the line break support, but we have hard time to convince Apple to agree w/ us. They believe the problems should only be solved by html and css but not in the level of JavaScript and if JavaScript provide such support it will be misused and damage the web. I believe if we do not support line break then people who need it will misuse the word break to implement it any way and damage even worst.

If you strongly believe adding line break support is essential, please file another bug here and request for adding line break granularity and put down the use case and motivation clearly. and I will try to reopen the issue in ECMA402 committee and TC39 to ask for reconsideration.

FrankYFTang commented 3 years ago

Also, I would suggest you to file bug to ask for line break support in v8 (https://bugs.chromium.org/p/v8/issues/ - assign to me ( ftang@chromium.org ) Components: Internationalization , Mozilla and Microsoft Edge, JSC . If all browser vendors receive more feature requests and agree with you, it may pressure them to accept the feature in ECMA402.

my2iu commented 3 years ago

I don't think I have the permissions to assign bugs to users or components in v8, so if I were to create a v8 bug, it would probably get lost in triage. It might be better if you were to create it then.

FrankYFTang commented 3 years ago

you can create one and send me the link. I will assign it to myself then.

On Mon, 3 May 2021 at 13:44, Ming Iu @.***> wrote:

I don't think I have the permissions to assign bugs to users or components in v8, so if I were to create a v8 bug, it would probably get lost in triage. It might be better if you were to create it then.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tc39/proposal-intl-segmenter/issues/136#issuecomment-831522147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2N2KKJWE55O53WJVVVSQTTL4DMDANCNFSM43G633PA .

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

my2iu commented 3 years ago

I think I already ended up filing it incorrectly, but I’ll let you fix it up:

https://bugs.chromium.org/p/v8/issues/detail?id=11744