rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Fine-tune tokenizers #80

Open rth opened 4 years ago

rth commented 4 years ago

It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by, a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.

There is probably a balance that needs to be found between the two.

For instance,

  1. PunctuationTokenizer,
    • currently doesn't take into account repeated punctuation
      >>> PunctuationTokenizer().tokenize("test!!!")                                                                                                     
      ['test!', '!', '!']
    • will tokenize abbreviations separated by . as separate sentence
      >>> PunctuationTokenizer().tokenize("W.T.O.")
      ['W.', 'T.', 'O.']

      both could probably be addressed by adding an option to force sentences to be longer than some minimal length (and otherwise append them to the previous token).

  2. UnicodeSentenceTokenizer, will not tokenizer sentences separated by a punctuation without space e.g.,
    >>> UnicodeSentenceTokenizer().tokenize('One sentence.Another sentence.')
    ['One sentence.Another sentence.']

    That's a very common occurrence in actual text, and I think a workaround should be found (e.g. using an additional tokenization pass with a regex/punctuation tokenizer).

Generally it would be good to add some evaluation benchmarks to evaluation/ for sentence tokenization to evaluation/ folder.

  1. UnicodeTokenizer is currently extended in VTextTokenizer (for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model).