Is text segmentation within the scope of this project?

jessegrosjean commented 7 years ago

I'm relatively new to unicode, cldr, etc.

My goal is to get text segmentation behavior as described in http://unicode.org/reports/tr29/ into my JavaScript app. If I'm reading things correctly it seems like text segmentation is appropriate, but not yet implemented in this project?

Assuming it is within scope, but isn't yet implemented, can someone provide a high level overview of how it could be implemented and estimate of how hard it would be to do? Or alternatively, do you have suggestions on other javascript libraries that support text segmentation?

rxaviers commented 7 years ago

Hi @jessegrosjean

it seems like text segmentation is appropriate, but not yet implemented in this project?

Correct. It's not part of globalize, but it's a welcome feature.

My goal is to get text segmentation behavior as described in http://unicode.org/reports/tr29/

Excellent, that's the exact reference we should be looking at. 😄 Note grapheme, word and sentence segmentation is defined in UAX 29. Line breaking is defined in UAX 14.

can someone provide a high level overview of how it could be implemented and estimate of how hard it would be to do?

With respect to globalize, it's ideal this is followed: (a) API design (could be informal, example), (b) TDD implementation by making segmenter its own module (info about development here) and update documentation at doc/ markdown files, (c) implement runtime module and globalize-compiler accordingly. I can help on each of this steps...

This could be helpful:

https://github.com/tc39/proposal-intl-segmenter
The above ecma-402 proposal brings this prototype implementation https://gist.github.com/inexorabletash/8c4d869a584bcaa18514729332300356
https://github.com/unicode-cldr/cldr-segments-modern

Having said that, I don't know how hard it would be to implement it...

Or alternatively, do you have suggestions on other javascript libraries that support text segmentation?

This is where I keep a list of the i18n JS libraries. It certainly need updates and so far it doesn't include segmentation: https://github.com/rxaviers/javascript-globalization

PS: cc @camertron who is presumably working on it given his questions in CLDR mailing list and could potentially provide input.

jessegrosjean commented 7 years ago

@rxaviers Thanks for all the information.

The only extra bit that I've found is:

https://github.com/twitter/twitter-cldr-js

I haven't tested, but it says it's implemented sentence segmentation according to http://www.unicode.org/reports/tr29/, but no other segmentation units yet.

I was hoping that I could magically just build up some quick regexes from the json data, but it looks a bit more involved then that. I'm building a hybrid javascript/native app, and expect that I'll just bridge back to native for now for text segmentation, but I'll keep watch for a pure javascript solution. It seems like the proposed Intl.Segmenter would be ideal for my needs, so I'll probably wait and see what happens there.

Thanks for answering my questions and sharing those links.

camertron commented 7 years ago

Thanks @rxaviers! @jessegrosjean I'm one of the maintainers of twitter-cldr-js, although I'm sad to say it hasn't been updated in quite some time. In fact, the segmentation implementation present in twitter-cldr-js was ported from the Ruby version (twitter-cldr-rb), which unfortunately wasn't 100% correct at the time. Since then, the Ruby version has been updated significantly and passes all but two of Unicode's segmentation tests. The javascript version really needs to be brought up to parity, I just haven't had the time. Still, twitter-cldr-js and twitter-cldr-rb should be pretty good reference implementations for you should you decide to implement segmentation yourself.

Some things to keep in mind:

Unicode Regular Expressions

You mentioned your hope that segmentation would be as simple as building up some quick regexes, and you're not wrong about that. The challenge lies in compiling those regexes. Take for example rule 5 from the word break segmentation rules, CLDR v31:

$AHLetter × $AHLetter

Looks simple enough. The $AHLetter symbols are variables that you get from other parts of the specification, and the × symbol indicates a break cannot occur between the two matching sides. Each of them expands into the following regex (newlines and indentation added for clarity):

[(
  \p{Word_Break=ALetter} 
  [
    \p{Word_Break=Format} \p{Word_Break=Extend} \p{Word_Break=ZWJ}
  ]*
) (
  \p{Word_Break=Hebrew_Letter}
  [
    \p{Word_Break=Format} \p{Word_Break=Extend} \p{Word_Break=ZWJ}
  ]*
)]

As you can see, your regular expression engine needs to support Unicode properties - that's what the \p{Word_Break=xyz} bits mean. In Javascript, that means bringing in a library or expecting your user's browsers to support ES6.

In the Ruby implementation, compiling the various regexes is done via a series of character ranges. The ranges are compressed by the RangeSet class and then turned into a regex. Naturally some of them can be very large, but overall I've seen decent performance from this approach.

Javascript Unicode Support

Aside from potential issues with Unicode properties, Javascript also has issues representing the entire Unicode character set since it uses UTF-16 encoding everywhere (see this README for a full explanation). In order to produce a fully conformant segmentation implementation in Javascript, you'll have to take UTF-16 surrogate pairs into consideration in your regexes as well.

Implicit Rules

UAX 29 specifies two implicit segmentation rules that turn out to be pretty important:

The implicit "final" rule is Any ÷ Any, implemented in the twitter-cldr-* libraries as /. ÷ ./
The implicit end-of-text rule is ÷ <eos> implemented as /.\z ÷/

ULI Exceptions

The CLDR contains a collection of ULI (Unicode Localization Interoperability) segmentation exceptions that avoid breaking after periods that come after abbreviations, eg "Mr.", "Mrs.", "Dr.", etc. They are specified for a number of languages. While not required, your implementation will probably want to take these rules into account.

My Recommendation

Ok, so I write all of this to give you some idea of how you might implement segmentation on your own. In my opinion however it's much easier to take the compiled regexes from twitter-cldr-rb and use them in your Javascript library (although you might have to do some surrogate pair munging like I described above). That's probably what I should do for twitter-cldr-js going forward too 😉

Here's how you might generate the regexes from twitter-cldr-rb (you'd need to have Ruby installed):

# Note that 'en' (English locale) isn't strictly necessary here, since the rule sets don't depend
# on locale to function. The locale is passed here for use by the part of the code that handles
# the aforementioned ULI exceptions.
rule_set = TwitterCldr::Segmentation::RuleSet.load('en', 'word')

rule_set.rules.each do |rule|
  rule.break?  # returns true/false, i.e. whether or not this rule indicates a break
  rule.left.to_regexp  # returns a regexp for the left-hand side of the rule
  rule.right.to_regexp  # returns a regexp for the right-hand side of the rule
end

I know this is a lot! Feel free to ping me with any questions. For some reason I love this stuff.

globalizejs / globalize

Is text segmentation within the scope of this project? #728