rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Character n-grams #40

Open rth opened 5 years ago

rth commented 5 years ago

Allowing tokenize documents with character n-grams would be useful.

rth commented 5 years ago

Partially addressed in #45

joshlk commented 4 years ago

I could look into implementing a ngram and skipgram iterator? Similar to the util functions in NLTK http://www.nltk.org/_modules/nltk/util.html#ngrams for characters and words (#2).

rth commented 4 years ago

Thanks @joshlk that would be very useful! Maybe without the rightpad/leftpad options for a start? It would also be interesting to have something that would work with ngram_range parameter as in scikit-learn CountVectorizer,

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

Though the extension of this parameter to skip grams is not clear.

There is also a question of how to chain tokenization + n-grams iterators https://github.com/rth/vtext/issues/21

joshlk commented 4 years ago

PR: #82

Please take a look when you get a chance