tc39 / proposal-intl-segmenter

Unicode text segmentation for ECMAScript
https://tc39.github.io/proposal-intl-segmenter/
146 stars 16 forks source link

Extensibility for non-ICU approaches? #134

Open nathanhammond opened 3 years ago

nathanhammond commented 3 years ago

Segmentation of character-based languages (without a clear textual segmentation indicator) is a research problem in natural language processing/computational linguistics. Given that, a user will always achieve better (quality, not necessarily speed) segmentation in these languages using a custom-written segmenter than if delegating to ICU's BreakIterator.

Should we consider extensibility to address this limitation as a top-level concern of this API?

For context, I have begun implementing an NLP approach for Cantonese segmentation (https://github.com/cantonese/segmenter), but it reimplements the entire proposed API of Intl.Segmenter.

sffc commented 3 years ago

I understand that V8 and Gecko are considering switching to ML-based engines for some languages; @FrankYFTang and @zbraniecki can talk more about that.

FrankYFTang commented 3 years ago

Should we consider extensibility to address this limitation as a top-level concern of this API?

Could you list specific issues for such "extensibility" need to be addressed? In other words, in what aspect your NLP project would be harder/easier to implement with change to the current proposal?