rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Add CharacterTokenizer #45

Closed rth closed 5 years ago

rth commented 5 years ago

Adds a Character tokenizer,

    tokenizer = CharacterTokenizer(window_size=4)
    assert tokenizer.tokenize("fox can't") == [
        "fox ", "ox c", "x ca", " can", "can'", "an't"]