twitter / twitter-cldr-rb

Ruby implementation of the ICU (International Components for Unicode) that uses the Common Locale Data Repository to format dates, plurals, and more.
Apache License 2.0
672 stars 93 forks source link

Segmentation Refactor #179

Closed camertron closed 8 years ago

camertron commented 8 years ago

I needed word segmentation for another feature I'm working on and decided to add support for it in our segmentation implementation. Doing so ended up being a real challenge because our current implementation doesn't correctly identify all the boundaries it should. Our implementation also doesn't run against the set of test cases published by the Unicode consortium. I ended up shaving the whole yak:

  1. Moved everything into the TwitterCldr::Segmentation namespace, including the rule parser.
  2. Introduced the concept of a "cursor" which holds the intermediate state of a boundary identification operation.
  3. Added several implicit rules as described in UAX 29.
  4. Added conformance tests that utilize Unicode's published test cases.
  5. Updated properties and blocks to Unicode v6.3.0.
  6. Refactored a bunch of the rule matching logic.
  7. Removed our own custom tailorings. Turns out when the implementation is correct, tailorings aren't necessary :wink:

Note: Two special Unicode test cases do not currently pass for word segmentation. They fail because apparently conformant implementations are expected to be able to match a partial regular expression, which Ruby can't do. In the future, it might be worth adopting ICU's rule matching approach, which is state machine-based. Only two cases fail out of 1372 total, which I thought was good enough.

camertron commented 8 years ago

Any idea when you'll be able to take a look @KL-7?

KL-7 commented 8 years ago

Hey, sorry about the delay. I looked through the diff and it all looks good to me overall. I'm not that familiar with segmentation algorithm, so I'll trust you with the implementation details.

camertron commented 8 years ago

Awesome, thank you!