tc39 / proposal-intl-segmenter-v2

Version 2 of Intl Segmenter. Adding line break support.
https://tc39.github.io/proposal-intl-segmenter-v2/
MIT License
12 stars 4 forks source link

`granularity: "syllable"` #12

Open LeaVerou opened 6 months ago

LeaVerou commented 6 months ago

It appears that the concept of a syllable is pretty universal across languages, yet in Intl.Segmenter, granularity goes from grapheme (basically letters) to word, with nothing more granular in between. Syllabification is very non-trivial to do manually, especially in a language-agnostic manner, and is needed in a number of use cases.

Furthermore, UAs already know how to syllabify, since they had to implement it for CSS hyphenation, so exposing the algorithm via Intl.Segmenter should be comparatively low effort to implement.

I’m aware this has been requested before with line breaking as the main use case, which is better served by other technologies (e.g. canvas formatted text), however a syllable is a low level enough concept that it applies to many use cases beyond that. For example, my use case was about making an app for teaching small children to read. In many languages (not English), it can be simpler to read syllables first, then combine them into the actual word. I'd wager there are many other linguistic applications as well.

image
gibson042 commented 6 months ago

Thanks for reporting this; I don't think we've seen such a use case before! But it's worth noting that Intl.Segmenter leans pretty heavily on Unicode Standard Annex #29, which defines grapheme, word, and sentence segmentation (helpfully including default rules for each) but does not define syllable segmentation. And the CSS hyphens spec is intentionally opaque on the topic ("CSS Text Level 3 does not define the exact rules for hyphenation; however UAs are strongly encouraged to optimize their choice of break points and to chose language-appropriate hyphenation points")... do you know how browsers determine the boundaries in practice? I'm assuming it's related to Unicode Standard Annex #14, but note that there is a difference between a syllable boundary and a hyphenation opportunity (especially when text includes non-word characters). Ideally, this would first get added to Unicode, then represented in ICU and adopted in ECMA 402.

See also https://github.com/unicode-org/icu4x/issues/164 .

LeaVerou commented 6 months ago

Thanks for explaining @gibson042! That makes total sense. It's also a bit of a bummer, because my whole proposal was predicated on the idea that it would be low effort. Defining syllable breaking in Unicode would require a lot stronger justification in terms of use case prevalence. Nevertheless, I'll do some digging on how browsers determine hyphenation boundaries and report back. It may be a good proxy, if it could be exposed.

fantasai commented 6 months ago

Hyphenation is typically done via language-specific dictionaries. If you want to create an API for it, you need to pass in the language, and expect it to fail for languages which don’t hyphenate or languages for which the dictionary is not installed.

It’s probably also worth reading https://www.w3.org/International/articles/typography/linebreak#sec_hyphenation Hyphenation doesn’t only break up the word, it can also alter the spelling; so it’s a little unclear how that would interface with a segmenter.

thibaudcolas commented 2 weeks ago

For what it’s worth – another common use case for this is readability scores that rely on syllable counts such as Flesch-Kincaid. Those are frequently used to improve accessibility, to assess conformance to WCAG SC 3.1.5 Reading Level.