Feature/kskip-ngram - Githubissues

joshlk commented 4 years ago

Implemented k-skip-n-grams. Has convenience functions for bigram, trigram, ngrams, everygrams and skipgrams.

Provides the same output as the equivalent nltk functions - although nltk does generate duplicates sometimes which are omitted here. The iterator consumes the input iterator only once and holds a window of items to generate the grams. The window is stepped forward as it consumes the input. It also correctly generates left or right padding if specified.

Currently the iterator outputs Vec<Vec<&str>>. I'm unsure if this is desirable or the function should join the items into a string to output Vec<String>. e.g. Currently it does: vec![vec!["One", "Two"], vec!["Two", "Three"] ...] but it might be desirable to have vec!["One Two", "Two Three"].

The nltk implementations tend to consume the input sequence multiple times and concatenate the output (for example everygram). I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code. I have tried to refractor it so it is as readable as much as possible but it would be good to get a second eye on it.

There is also still the question of how to chain the different components together (which links to #21). Currently the transform method takes an input iterator and provides an iterator as output which can be consumed by the user.

Todo:

[x] Add Python interface
[x] Benchmark against nltk
[ ] ~~Add function for character ngrams~~
[x] Further documentation

joshlk commented 4 years ago

Some more information to provide some background...

Features:

"ngram" - for example if n=2 it produces bigrams for example [["One", "Two"], ["Two", "Three"] ...]
"everygram" - all ngrams within a min_n and max_n range. e.g. min_n=1, max_n=2 [["One"], ["One", "Two"], ["Two"], ["Two", "Three"] ...]
"skipgram" - skip a max total of k items. e.g. if n=2, k=2: [["One", "Two"], ["One", "Three"], ["One", "Four"], ["Two", "Three"], ["Two", "Four"], ["Two", "Five"] ...]
padding - pad left or right items. e.g. if n=2, k=2, pad_left="\<s>": [["\<s>", "One"], ["\<s>", "Two"], ["\<s>", "Three"] ...]

An overview of how KSkipNGramsIter works:

On initialisation of the struct build_window populates the first few items into a window of items
The KSkipNGramsIter then has 4 modes which is executes in order exhausting each one.
The first mode is PadLeft which generates the left padding items from the pre-populated window. It essentially iterates through a number of parameters which generate different samples of items from the window:
- n - the number of items i.e. 1, 2, 3. e.g. [["One"], ["One", "Two"], ["One", "Two", "Three"] ...] The range defined by min_n and max_n
- p - the amount of padding. For example the gram ["\<s>", "One", "Two"] has padding=1, n=3. The range is between 0 and max_n-1
- sample_iter - this is a separate iterator which is initialised for each n,p combination. When there is no skipping (max_k=0) it just produces the indexes of the first elements i.e. ["One", "Two"]. But when skipping is enabled (k>0) it iterates through all possible combinations. For example if k = 2 (n=2, p=0) it would produce the index's of [["One", "Two"], ["One", "Three"], ["One", "Four"], ["Two", "Three"], ["Three", "Four"]]
Second mode is Main. For each step in the window it iterates through all possible n's and gram skipping (sample_iter i.e. k>0). Then the window is stepped one forward.
Third mode is MainEnd. This samples from the last window (when the input iterator has been exhausted). This is only necessary when n has a range (i.e. min_n != max_n). Here an offset parameter is incremented which simulates a smaller window .
Four mode is PadRight. This is similar to PadLeft as the name suggests.

Let me know if you require further clarification or information. Happy to help and take feedback 😃

joshlk commented 4 years ago

It's very nice work, I'm just a bit concerned about the complexity of the implementation for future maintenance.

Fair point. I will implement a simpler version and then we can compare performance. You would also loose the ordering of tokens but this isn't such a big problem.

joshlk commented 4 years ago

I have changed the approach and it now chains ngram or skipgram iterators together. Overall its much simpler to understand. Only issue is that you loose the ordering of the tokens as said before but thats not such a big issue.

I did some simple benchmarking which aren't included in this pull request and the performance is actually 2x faster which was surprising as the input iterator is split using tee and multiple caches are used. So overall much better. Let me know what you think.

joshlk commented 4 years ago

I have added the Python API. Currently it takes a list and output a list. Do we want the Python interface to take a iterator as input and produce an iterator?

rth commented 4 years ago

Thanks a lot @joshlk ! I'll review in more detail tomorrow but after a superficial glance this version does significantly less complex.

Do we want the Python interface to take a iterator as input and produce an iterator?

I think a list is fine. That's what we are doing for tokenizers. If we change that later we should do it later everywhere, but I suspect that lists won't have much performance impact and are easier to manipulate.

There is just a linting issue in the CI.

rth commented 3 years ago

I merged one minor suggestion in the test and then had to fix issues it introduced, sorry )

There is indeed a few clippy errors that it would be good to fix,

``` warning: unneeded `return` statement --> src/token_processing/mod.rs:339:9 | 339 | return Some(Vec::from(self.window.clone())); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: remove `return`: `Some(Vec::from(self.window.clone()))` | = note: `#[warn(clippy::needless_return)]` on by default = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return warning: unneeded `return` statement --> src/token_processing/mod.rs:446:9 | 446 | / return match next_sample { 447 | | // Generate and return samples using self.sample_iter 448 | | Some(sample_idx) => { 449 | | let mut sample = Vec::with_capacity(sample_idx.len()); ... | 489 | | } 490 | | }; | |__________^ | = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return help: remove `return` | 446 | match next_sample { 447 | // Generate and return samples using self.sample_iter 448 | Some(sample_idx) => { 449 | let mut sample = Vec::with_capacity(sample_idx.len()); 450 | for idx in sample_idx.iter() { 451 | sample.push(self.window[*idx].clone()); ... error: using `clone` on a double-reference; this will copy the reference instead of cloning the inner type --> src/token_processing/mod.rs:451:33 | 451 | sample.push(self.window[*idx].clone()); | ^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: `#[deny(clippy::clone_double_ref)]` on by default = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#clone_double_ref help: try dereferencing it | 451 | sample.push(&(*self.window[*idx]).clone()); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: or try being explicit if you are sure, that you want to clone a reference | 451 | sample.push(<&str>::clone(self.window[*idx])); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ```

Otherwise we can merge, unless you wanted to add other things?

rth commented 3 years ago

Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for KSkipNGrams(min_n=1, max_n=1, max_k=1) (here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.

Also I'm actually somewhat confused as to why KSkipNGrams(min_n=1, max_n=1, max_k=k) returns the input as is (with an extra list container) for any k. I would have expected skip grams to only make sense for max_n >=2 and raise an error otherwise, or am I missing something?

joshlk commented 3 years ago

Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for KSkipNGrams(min_n=1, max_n=1, max_k=1) (here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.

Agreed. I have changed it so that empty input or input that is smaller than n gives an empty output. I have added tests for these cases. On a side note, strangely "".split(" ") does not provide an empty list in rust but gives an output of [""]. This could be one to watch out for.

Also I'm actually somewhat confused as to why KSkipNGrams(min_n=1, max_n=1, max_k=k) returns the input as is (with an extra list container) for any k. I would have expected skip grams to only make sense for max_n >=2 and raise an error otherwise, or am I missing something?

Because k can equal 0 which doesn't not require max_n >=2.

cargo clippy

I went through the cargo clippy suggestions which were very useful. Thanks for bring my attention to that tool. Might be worth adding into the CI toolchain. There were a few suggestions for other files in the package but I left those alone.

Otherwise we can merge, unless you wanted to add other things?

Do we want to change CharacterTokenizer so it uses this n-gram interface as the backend? I also noticed it uses char_indices to split the string into characters, would it be better to use grapheme_indices from the unicode segmentation package?

Also, do we want to include the connivance methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)

benchmark

I have created some benchmarks in Python and the vtext functions are currently considerably slower than the nltk equivalents. Not sure why this is the case at this point. Maybe to do with the conversion of strings from rust to python. This was the output:

# Testing 19924 documents
vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS]
nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS]
vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS]
nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS]

rth commented 3 years ago

This was the output:

Are you sure you run setup.py install and not develop to be in release mode?

rth commented 3 years ago

Do we want to change CharacterTokenizer so it uses this n-gram interface as the backend? I also noticed it uses char_indices to split the string into characters, would it be better to use grapheme_indices from the unicode segmentation package?

Good point. Let's keep it for now and consider that in a follow up PR as this one is already getting quite large.

Also, do we want to include the convenience methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)

Well NLTK doesn't really a have too homogeneous and consistent API across modules, as far as I can tell so I would be careful not too adapt it too closely. But we certainly need to think to how to make all these parts work smoothly.

text functions are currently considerably slower than the nltk equivalents.

Hmm, yes I can confirm with setup.py install. We would need to profile (e.g. with flamegraph to understand what's going on. I would say it's shouldn't be rust->python conversions since we are doing that for character ngrams in CharacterTokenizer and that's working fine in Python. I can look into profiling as well in the next few days..

Another unrelated thing I was wondering about, it wouldn't be faster and to easier map tokens to an index at the beginning, pass those usize indices around and finally recover the list of n-gram string tokens at the end. Technically &str should be equivalent I guess but we would have less issues with lifetimes and potentially the cost of making copies.

rth / vtext

Feature/kskip-ngram #82