10 word tokenizer tests. (8 pass, 2 fail). The incorrect tokenizations are:
a. "Dr." ==> "Dr", "." . Expected ==> "Dr."
b. "3:00" ==> "3", ":", "00". Expected ==> "3:00"
3 custom regular expression tokenizer tests. Compared to NLTK tests, the tests for regular expression with named group and back references are skipped.
Simple sentence splitter test.
Open Questions
Do we have implementations for tokenizers with regex containing named groups/back references? If no, any plans to implement?
Also, NLTK actually does not support back references. So if we support, should we actually support or just notify lack of support like NLTK does :-( ?
This PR adds tests for tokenizers using https://github.com/nltk/nltk/blob/develop/nltk/test/tokenize.doctest as a reference.
The test details are
Open Questions
Related to this, http://weitz.de/cl-ppcre/#*allow-named-registers*, cl-ppcre has support for named groups/back references. (After all, it's an Edi Weitz library!)