Open rth opened 5 years ago
Partially addressed in #45
I could look into implementing a ngram and skipgram iterator? Similar to the util functions in NLTK http://www.nltk.org/_modules/nltk/util.html#ngrams for characters and words (#2).
Thanks @joshlk that would be very useful! Maybe without the rightpad/leftpad options for a start? It would also be interesting to have something that would work with ngram_range parameter as in scikit-learn CountVectorizer,
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
Though the extension of this parameter to skip grams is not clear.
There is also a question of how to chain tokenization + n-grams iterators https://github.com/rth/vtext/issues/21
PR: #82
Please take a look when you get a chance
Allowing tokenize documents with character n-grams would be useful.