Segmentation Refactor - Githubissues

camertron commented 8 years ago

I needed word segmentation for another feature I'm working on and decided to add support for it in our segmentation implementation. Doing so ended up being a real challenge because our current implementation doesn't correctly identify all the boundaries it should. Our implementation also doesn't run against the set of test cases published by the Unicode consortium. I ended up shaving the whole yak:

Moved everything into the TwitterCldr::Segmentation namespace, including the rule parser.
Introduced the concept of a "cursor" which holds the intermediate state of a boundary identification operation.
Added several implicit rules as described in UAX 29.
Added conformance tests that utilize Unicode's published test cases.
Updated properties and blocks to Unicode v6.3.0.
Refactored a bunch of the rule matching logic.
Removed our own custom tailorings. Turns out when the implementation is correct, tailorings aren't necessary :wink:

Note: Two special Unicode test cases do not currently pass for word segmentation. They fail because apparently conformant implementations are expected to be able to match a partial regular expression, which Ruby can't do. In the future, it might be worth adopting ICU's rule matching approach, which is state machine-based. Only two cases fail out of 1372 total, which I thought was good enough.

camertron commented 8 years ago

Any idea when you'll be able to take a look @KL-7?

KL-7 commented 8 years ago

Hey, sorry about the delay. I looked through the diff and it all looks good to me overall. I'm not that familiar with segmentation algorithm, so I'll trust you with the implementation details.

camertron commented 8 years ago

Awesome, thank you!

twitter / twitter-cldr-rb

Segmentation Refactor #179