vseloved / cl-nlp

Common Lisp NLP toolset
Other
219 stars 28 forks source link

Add tests for tokenizers #6

Closed dmsurti closed 9 years ago

dmsurti commented 9 years ago

This PR adds tests for tokenizers using https://github.com/nltk/nltk/blob/develop/nltk/test/tokenize.doctest as a reference.

The test details are

  1. 10 word tokenizer tests. (8 pass, 2 fail). The incorrect tokenizations are: a. "Dr." ==> "Dr", "." . Expected ==> "Dr." b. "3:00" ==> "3", ":", "00". Expected ==> "3:00"
  2. 3 custom regular expression tokenizer tests. Compared to NLTK tests, the tests for regular expression with named group and back references are skipped.
  3. Simple sentence splitter test.

    Open Questions

  4. Do we have implementations for tokenizers with regex containing named groups/back references? If no, any plans to implement?
  5. Also, NLTK actually does not support back references. So if we support, should we actually support or just notify lack of support like NLTK does :-( ?

Related to this, http://weitz.de/cl-ppcre/#*allow-named-registers*, cl-ppcre has support for named groups/back references. (After all, it's an Edi Weitz library!)