Open joshlk opened 4 years ago
Some more information to provide some background...
Features:
An overview of how KSkipNGramsIter
works:
build_window
populates the first few items into a window of itemsPadLeft
which generates the left padding items from the pre-populated window. It essentially iterates through a number of parameters which generate different samples of items from the window:
n
- the number of items i.e. 1, 2, 3. e.g. [["One"], ["One", "Two"], ["One", "Two", "Three"] ...] The range defined by min_n and max_np
- the amount of padding. For example the gram ["\<s>", "One", "Two"] has padding=1, n=3. The range is between 0 and max_n-1sample_iter
- this is a separate iterator which is initialised for each n,p combination. When there is no skipping (max_k=0) it just produces the indexes of the first elements i.e. ["One", "Two"]. But when skipping is enabled (k>0) it iterates through all possible combinations. For example if k = 2 (n=2, p=0) it would produce the index's of [["One", "Two"], ["One", "Three"], ["One", "Four"], ["Two", "Three"], ["Three", "Four"]]Main
. For each step in the window it iterates through all possible n's and gram skipping (sample_iter i.e. k>0). Then the window is stepped one forward.MainEnd
. This samples from the last window (when the input iterator has been exhausted). This is only necessary when n has a range (i.e. min_n != max_n). Here an offset parameter is incremented which simulates a smaller window .PadRight
. This is similar to PadLeft
as the name suggests.Let me know if you require further clarification or information. Happy to help and take feedback 😃
It's very nice work, I'm just a bit concerned about the complexity of the implementation for future maintenance.
Fair point. I will implement a simpler version and then we can compare performance. You would also loose the ordering of tokens but this isn't such a big problem.
I have changed the approach and it now chains ngram or skipgram iterators together. Overall its much simpler to understand. Only issue is that you loose the ordering of the tokens as said before but thats not such a big issue.
I did some simple benchmarking which aren't included in this pull request and the performance is actually 2x faster which was surprising as the input iterator is split using tee
and multiple caches are used. So overall much better. Let me know what you think.
I have added the Python API. Currently it takes a list and output a list. Do we want the Python interface to take a iterator as input and produce an iterator?
Thanks a lot @joshlk ! I'll review in more detail tomorrow but after a superficial glance this version does significantly less complex.
Do we want the Python interface to take a iterator as input and produce an iterator?
I think a list is fine. That's what we are doing for tokenizers. If we change that later we should do it later everywhere, but I suspect that lists won't have much performance impact and are easier to manipulate.
There is just a linting issue in the CI.
I merged one minor suggestion in the test and then had to fix issues it introduced, sorry )
There is indeed a few clippy errors that it would be good to fix,
Otherwise we can merge, unless you wanted to add other things?
Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for KSkipNGrams(min_n=1, max_n=1, max_k=1)
(here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.
Also I'm actually somewhat confused as to why KSkipNGrams(min_n=1, max_n=1, max_k=k)
returns the input as is (with an extra list container) for any k
. I would have expected skip grams to only make sense for max_n >=2
and raise an error otherwise, or am I missing something?
Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for
KSkipNGrams(min_n=1, max_n=1, max_k=1)
(here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.
Agreed. I have changed it so that empty input or input that is smaller than n
gives an empty output. I have added tests for these cases. On a side note, strangely "".split(" ")
does not provide an empty list in rust but gives an output of [""]
. This could be one to watch out for.
Also I'm actually somewhat confused as to why
KSkipNGrams(min_n=1, max_n=1, max_k=k)
returns the input as is (with an extra list container) for anyk
. I would have expected skip grams to only make sense formax_n >=2
and raise an error otherwise, or am I missing something?
Because k
can equal 0 which doesn't not require max_n >=2
.
cargo clippy
I went through the cargo clippy
suggestions which were very useful. Thanks for bring my attention to that tool. Might be worth adding into the CI toolchain. There were a few suggestions for other files in the package but I left those alone.
Otherwise we can merge, unless you wanted to add other things?
Do we want to change CharacterTokenizer
so it uses this n-gram interface as the backend? I also noticed it uses char_indices
to split the string into characters, would it be better to use grapheme_indices
from the unicode segmentation package?
Also, do we want to include the connivance methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)
benchmark
I have created some benchmarks in Python and the vtext functions are currently considerably slower than the nltk equivalents. Not sure why this is the case at this point. Maybe to do with the conversion of strings from rust to python. This was the output:
# Testing 19924 documents
vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS]
nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS]
vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS]
nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS]
This was the output:
Are you sure you run setup.py install
and not develop
to be in release mode?
Do we want to change CharacterTokenizer so it uses this n-gram interface as the backend? I also noticed it uses char_indices to split the string into characters, would it be better to use grapheme_indices from the unicode segmentation package?
Good point. Let's keep it for now and consider that in a follow up PR as this one is already getting quite large.
Also, do we want to include the convenience methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)
Well NLTK doesn't really a have too homogeneous and consistent API across modules, as far as I can tell so I would be careful not too adapt it too closely. But we certainly need to think to how to make all these parts work smoothly.
text functions are currently considerably slower than the nltk equivalents.
Hmm, yes I can confirm with setup.py install
. We would need to profile (e.g. with flamegraph to understand what's going on. I would say it's shouldn't be rust->python conversions since we are doing that for character ngrams in CharacterTokenizer and that's working fine in Python. I can look into profiling as well in the next few days..
Another unrelated thing I was wondering about, it wouldn't be faster and to easier map tokens to an index at the beginning, pass those usize
indices around and finally recover the list of n-gram string tokens at the end. Technically &str
should be equivalent I guess but we would have less issues with lifetimes and potentially the cost of making copies.
Implemented k-skip-n-grams. Has convenience functions for bigram, trigram, ngrams, everygrams and skipgrams.
Provides the same output as the equivalent nltk functions - although nltk does generate duplicates sometimes which are omitted here. The iterator consumes the input iterator only once and holds a window of items to generate the grams. The window is stepped forward as it consumes the input. It also correctly generates left or right padding if specified.
Currently the iterator outputs
Vec<Vec<&str>>
. I'm unsure if this is desirable or the function should join the items into a string to outputVec<String>
. e.g. Currently it does:vec![vec!["One", "Two"], vec!["Two", "Three"] ...]
but it might be desirable to havevec!["One Two", "Two Three"]
.The nltk implementations tend to consume the input sequence multiple times and concatenate the output (for example everygram). I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code. I have tried to refractor it so it is as readable as much as possible but it would be good to get a second eye on it.
There is also still the question of how to chain the different components together (which links to #21). Currently the
transform
method takes an input iterator and provides an iterator as output which can be consumed by the user.Todo:
Add function for character ngrams