Closed camertron closed 8 years ago
Any idea when you'll be able to take a look @KL-7?
Hey, sorry about the delay. I looked through the diff and it all looks good to me overall. I'm not that familiar with segmentation algorithm, so I'll trust you with the implementation details.
Awesome, thank you!
I needed word segmentation for another feature I'm working on and decided to add support for it in our segmentation implementation. Doing so ended up being a real challenge because our current implementation doesn't correctly identify all the boundaries it should. Our implementation also doesn't run against the set of test cases published by the Unicode consortium. I ended up shaving the whole yak:
TwitterCldr::Segmentation
namespace, including the rule parser.Note: Two special Unicode test cases do not currently pass for word segmentation. They fail because apparently conformant implementations are expected to be able to match a partial regular expression, which Ruby can't do. In the future, it might be worth adopting ICU's rule matching approach, which is state machine-based. Only two cases fail out of 1372 total, which I thought was good enough.