NLP pipeline design - Githubissues

Ideally, an NLP pipeline in Rust could look something like,

preprocessor = DefaultPreprocessor::new()
tokenizer = RegexpTokenizer::new(r"\b\w\w+\b")
stemmer = SnowballStemmer::new("en")
analyzer = NgramAnalyzer(range=(1, 1))

pipe = collection
          .map(preprocessor)
          .map(tokenizer)
          .map(|tokens| tokens.map(stemmer))
          .map(analyzer)

where collection is an iterator over documents.

There are several chalenges with it though,

[ ] It is better to avoid allocating strings for tokens in each pre-processing step and instead use a slice of the original document. Performance depends very strongly on this. The current implementation e.g. of RegexpTokenizer takes a reference to the document and return an Iterable of &str with the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline. This may be related to using closures (cf next point) though.
[ ] Because structs are not callable, collection.map(tokenizer) doesn't work, nor does collection.map(tokenizer.tokenize) (i.e. using a method) for some reason. We can use collection.map(|document| tokenizer.tokenize(&document)) but then lifetime is not properly handled between input and output (described in the previous point).

More investigation would be necessary, and both points are likely related.

Having a clearly defined pipeline is also useful for facilitating counting tokens in parallel (https://github.com/rth/text-vectorize/pull/20).

The current implementation e.g. of RegexpTokenizer takes a reference to the document and return an Iterable of &str with the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline.

One way around it might be to return the original string, along with tokens (instead of just tokens) from tokenizers and analyzers.

Data structure

It is better to avoid allocating strings for tokens in each pre-processing step and instead use a slice of the original document. Performance depends very strongly on this.

I ran some simple benchmarks to compare the read/write performance of Vec<&str> and Vec<String>. I also compared a JaggeredArray which stores all the strings in a continuous block in memory (in contrast to Vec<String>) and therefore in theory should minimise cache misses.

I ran 3 read performance tests: summing the length of tokens, determining unique vocab and applying the stemmer to each token. Overall if you take into consideration the variance (+/-) in the results there isn't much difference - which surprised me. I assumed that Vec<String> would have a large performance costs as the strings would be scattered all over memory.

I ran 2 write performance test: 1) tokenising the text and outputting either a Vec<&str> or Vec<String> and 2) creating a vector of unique tokens (maps type T -> T). Here, as suggested by @rth, Vec<&str> hugely outperforms Vec<String>.

Vec<&str> also has the advantage of saving memory and preventing coping of the data.

Pipeline

I have been experimenting with different ways the pipeline components (tokenizer, stemmer, ngramer...) could be implemented. Here are my thoughts so far...

Each component could have a standard method transform. So the pipeline would something look like:

let tokens = tokenizer.transform(text);
let stemed = stemmer.transform(tokens);
let filtered = filter.transform(stemed).collect::<Vec<_>>();

Each component is producing and consuming, where appropriate, an iterator and the results are collected at the end of the pipeline. This reduces intermittent memory use and doesn't have a performance cost when compared to raw for-loops.

In the example, tokenizer is outputting an string slice &str referencing the original text but stemmer produces new strings and so outputs String. There is therefore a need to design components so that they can consume and produce tokens as either &str or String depending on previous downstream components. One way to overcome this is by wrapping the tokens in an enum which either contains a &str or String. This way we can keep the performance benefits of &str while making each component compatible with each other.

Here is a full dummy example (rust playground):

#[allow(dead_code)]
#[allow(unused_variables)]

#[derive(Debug)]
enum StringWrap<'a> {
    Slice(&'a str),
    String(String),
}

struct DummyTokenizer {
    split_on: String,
}

impl DummyTokenizer {
    pub fn transform<'a>(&'a self, text: &'a str) -> impl Iterator<Item = StringWrap<'a>> + 'a {
        text.split(&self.split_on).map(|x| StringWrap::Slice(x))
    }
}

struct DummyFilter {
    word: String,
}

impl DummyFilter {
    //for  TokenToToken
    pub fn transform<'a>(
        &'a self,
        tokens: impl Iterator<Item = StringWrap<'a>> + 'a,
    ) -> impl Iterator<Item = StringWrap<'a>> + 'a {
        tokens.filter(move |x| match x {
            StringWrap::Slice(s) => s.clone() != self.word,
            StringWrap::String(s) => s.clone() != self.word,
        })
    }
}

struct DummyStemmer {}

impl DummyStemmer {
    pub fn transform<'a>(
        &'a self,
        tokens: impl Iterator<Item = StringWrap<'a>> + 'a,
    ) -> impl Iterator<Item = StringWrap<'a>> + 'a {
        // Outputs a StringWrap::string
        tokens.map(|x| match x {
            StringWrap::Slice(s) => StringWrap::String([s, "ing"].join("")),
            StringWrap::String(s) => StringWrap::String([s, "ing".to_string()].join("")),
        })
    }
}

fn main() {
    // API example
    let text = "Marry had a little lamb.";

    // Pipeline components
    let tokenizer = DummyTokenizer {
        split_on: " ".to_string(),
    };
    let filter = DummyFilter {
        word: "lamb".to_string(),
    };
    let stemmer = DummyStemmer {};

    // Pipeline
    let output = tokenizer.transform(text);
    let output = filter.transform(output);
    let output = stemmer.transform(output).collect::<Vec<_>>();

    println!("{:?}", output);
}

In above, DummyFilter's inputs and outputs match. While DummyStemmer always produces string even if the input is string slices.

In summary:

Each component can consume and produce either strings or string slices
Each component exposes a standard transform method with consumes or produces iterators

Some possible problems:

It's currently not possible to write a trait which includes impl Iterator<...>. There a few work arounds but none of them are ideal. It might not be necessary to write a trait. Im not sure...?
Some work will need to be done on how to expose each component in Python in a performant way. I haven't looked into that yet.

Thanks a lot for the analysis and investigation @joshlk !

Each component exposes a standard transform

+1 generally on that. The initial reason I didn't go with it initially, I think, was that the signature of the transform method wouldn't be exactly the same between different components. Using an enum for &str / String is a nice workaround, but there is still the fact that some take as input a string (e.g. tokenizers), while other take as input an iterator of Strings. So even if the method name is the same, we won't easily be able to implement a common trait for them?

Each component [..] produces iterators

Sounds good as well. I was just thinking we might need to have some container object to avoid doing this chaining of iterators manually. Something along the lines of the pipeline object in spacy. It could be something along the lines of,

    let pipe = Pipeline::new(tokenizer, filter, stemmer);
    let output = pipe.transform(text);

(full example here) but that means each component would need to be identified by some trait and designed in advance. One could then apply this e.g. on a set of documents with,

    let documents = vec!["The Moon is an astronomical body orbiting Earth as its only natural satellite.", "It is the fifth-largest satellite in the Solar System", "The Moon is, after Jupiter's satellite Io, the second-densest satellite in the Solar System"];

    let output = documents.iter().map(|doc| pipe.transform(doc).collect::<Vec<_>>()).collect::<Vec<_>>();

or in parallel,

use rayon::prelude::*;

let output = documents.par_iter().map(|doc| pipe.transform(doc).collect::<Vec<_>>()).collect::<Vec<_>>();

A bit similarly to what is currently done for vectorizers here. The double collect of nested iterators is maybe not ideal though.

Another alternative could be to define an arbitrary list of steps e.g.

let step = vec![tokenizer, filter, stemmer];
let pipe = Pipeline::new(steps);
let output = pipe.transform(text);

this would be more flexible, but again because the signatures of transform are not fully identically it might be difficult to make it work.

Another information to consider is that HuggingFace tokenizers crate are essentially solving this same problem. They have a pipeline struct (called "Tokenizer") that consists of 4 steps, where for instance the signature of a whitespace tokenizer can be found here. I wouldn't have minded re-using their API or even some models since clearly they have good traction and more resources at the moment, however the limitations I see are,

that the return of each step is a Vec<(String, Offsets)> which would have a negative performance impact with respect to the current approach. I need to run their benchmarks to compare more precisely.
their API require to to compute offsets which might be a good thing in general, and particularly useful with neural net model, but that excludes regexp and unicode segmentation tokenizers as far as I understand.

Studying their API more and possible ways to interact might be worthwhile though.

I ran some simple benchmarks to compare the read/write performance

It could be nice to add some of those benchmarks for tokenizers. It would allow measuring in particular the overhead of the Python wrapper by comparing with results in bechmarks/

It's currently not possible to write a trait which includes impl Iterator<...>. There a few work arounds but none of them are ideal. It might not be necessary to write a trait. Im not sure...?

Returning a Box<dyn Iterator<...>> as done in https://github.com/rth/vtext/pull/78 should work I think though I agree it's not the easiest to read.

+1 generally on that. The initial reason I didn't go with it initially, I think, was that the signature of the transform method wouldn't be exactly the same between different components. Using an enum for &str / String is a nice workaround, but there is still the fact that some take as input a string (e.g. tokenizers), while other take as input an iterator of Strings. So even if the method name is the same, we won't easily be able to implement a common trait for them?

We can construct a pipeline function using a macro that doesn't require a common trait. All that is required is a transform method. For example:

macro_rules! pipeline {
    ($input:expr, $($args:expr),*) => {{
        let output = $input;
        $(
            let output = $args.transform(output);
        )*
        output.collect()
    }}
}

let vecs: Vec<String> = pipeline!(
    text,
    tokenizer,
    filter,
    stemmer
);

If there isn't a common trait, is there any other downsides?

Another option could be that the tokenizer's transform method maps an iterator to an iterator and to use it you have to wrap the input in an iterator. Another standard method could be transform_single which does just that.

or in parallel

I don't have any experience with rayon or parallel processing in rust and so can't comment here.

HuggingFace: return of each step is a Vec<(String, Offsets)>

Vec<(String, Offsets)> is a similar idea to a jagged array.

I don't think they have a unifying trait or interface for all pipeline components from what I can see.

Returning a Box<dyn Iterator<...>>

I believe this has a performance impact as it uses dynamic dispatch. But I think it's necessary if you want all components to have a common trait.

As far as I can work out (correct me if I'm wrong) you have to make a trade off:

Components derive from a common trait and use Box<dyn Iterator<...>> in the method signatures
Components don't derive from a common trait and use impl Iterator<...> in the method signatures

So there are two considerations:

What is the performance impact of using dynamic dispatch?
What is lost if the components don't derive from a common trait?

I ran a benchmark comparing using Box<dyn Iterator<...>> and impl Iterator<...>. The results are:

test dyn_pipeline  ... bench: 206,121,099 ns/iter (+/- 53,332,452)
test impl_pipeline ... bench: 217,948,970 ns/iter (+/- 157,269,946)

So doesn't seem like much of a difference. Which is good as Box<dyn Iterator<...>> is a lot more simpler to use and we can create a common trait.

Here is an example of a shared trait using Box<dyn Iterator<...>>:

#[allow(dead_code)]
#[allow(unused_variables)]
#[derive(Debug)]
enum StringWrap<'a> {
    Slice(&'a str),
    String(String),
}

trait PipelineComponent<'a> {
    type InItem;
    type OutItem;

    fn transform(
        &'a self,
        items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
    ) -> Box<dyn Iterator<Item = Self::OutItem> + 'a>;
}

struct DummyTokenizer {
    split_on: String,
}

impl<'a> DummyTokenizer {
    fn transform_text(&'a self, text: &'a str) -> Box<dyn Iterator<Item = StringWrap<'a>> + 'a> {
        let iter = text.split(&self.split_on).map(|x| StringWrap::Slice(x));
        Box::new(iter)
    }
}

impl<'a> PipelineComponent<'a> for DummyTokenizer {
    type InItem = &'a str;
    type OutItem = StringWrap<'a>;

    fn transform(
        &'a self,
        items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
    ) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
        let iter = items.flat_map(move |s| self.transform_text(s));
        Box::new(iter)
    }
}

struct DummyFilter {
    word: String,
}

impl<'a> PipelineComponent<'a> for DummyFilter {
    type InItem = StringWrap<'a>;
    type OutItem = Self::InItem;

    fn transform(
        &'a self,
        items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
    ) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
        let iter = items.filter(move |x| match x {
            StringWrap::Slice(s) => s.clone() != self.word,
            StringWrap::String(s) => s.clone() != self.word,
        });
        Box::new(iter)
    }
}

struct DummyStemmer {}

impl<'a> PipelineComponent<'a> for DummyStemmer {
    type InItem = StringWrap<'a>;
    type OutItem = Self::InItem;

    fn transform(
        &'a self,
        items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
    ) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
        // Outputs a StringWrap::string
        let iter = items.map(|x| match x {
            StringWrap::Slice(s) => StringWrap::String([s, "ing"].join("")),
            StringWrap::String(s) => StringWrap::String([s, "ing".to_string()].join("")),
        });
        Box::new(iter)
    }
}

struct DummyEmbedder {}

impl<'a> PipelineComponent<'a> for DummyEmbedder {
    type InItem = StringWrap<'a>;
    type OutItem = Vec<u8>;

    fn transform(
        &'a self,
        items: Box<dyn Iterator<Item = Self::InItem> + 'a>,
    ) -> Box<dyn Iterator<Item = Self::OutItem> + 'a> {
        let iter = items.map(|x| match x {
            StringWrap::Slice(s) => Vec::from(s.as_bytes()),
            StringWrap::String(s) => Vec::from(s.as_bytes()),
        });
        Box::new(iter)
    }
}

fn main() {
    // API example
    let text = "Marry had a little lamb.";

    // Pipeline components
    let tokenizer = DummyTokenizer {
        split_on: " ".to_string(),
    };
    let filter = DummyFilter {
        word: "lamb".to_string(),
    };
    let stemmer = DummyStemmer {};
    let embedder = DummyEmbedder {};

    // Pipeline
    let output = tokenizer.transform_text(text);
    let output = filter.transform(output);
    let output = stemmer.transform(output);
    let output = embedder.transform(output).collect::<Vec<u8>>();

    println!("{:?}", output);
}

Key features:

All components implement from PipelineComponent which has a transform method
DummyTokenizer has a transform_text method for a single pice of text and then implements transform for multiple items of text
The input and output types are annotatable. So for example DummyEmbedder has an input item type of StringWrap and outputs Vec<u8>

I haven't done so already but it should be relatively easy to create a Pipeline object.

Thanks for investigating @joshlk !

We can construct a pipeline function using a macro that doesn't require a common trait. [..] If there isn't a common trait, is there any other downsides?

The macro approach is interesting in that it would allow combining steps with more flexibility. The issue that if there is no pipeline object, it would be a bit harder to interface with code that is expected to take a pipeline as input (e.g. here) and it won't be possible to serialize pipelines.

I ran a benchmark comparing using Box<dyn Iterator<...>> and impl Iterator<...>. [..] So doesn't seem like much of a difference.

Thanks for doing that! Yes, I was hoping there wouldn't be that much difference. I also get very similar results when running the benchmark in your branch.

Here is an example of a shared trait using Box<dyn Iterator<...>>:

I'm getting the following error when compiling this code, probably a minor issue,

``` error[E0277]: a value of type `std::vec::Vec` cannot be built from an iterator over elements of type `std::vec::Vec` --> examples/example1.rs:118:45 | 118 | let output = embedder.transform(output).collect::>(); | ^^^^^^^ value of type `std::vec::Vec` cannot be built from `std::iter::Iterator>` | = help: the trait `std::iter::FromIterator>` is not implemented for `std::vec::Vec` error: aborting due to previous error For more information about this error, try `rustc --explain E0277`. error: could not compile `vtext`. ```

but removing the DummyEmbedder step works as expected.

Static dispatch for input could also work if we want to, though maybe the signatures are getting a bit more complex,

trait PipelineComponent<'a> {
    type InItem;
    type OutItem;

    fn transform<T>(&'a self, items: T) -> Box<dyn Iterator<Item = Self::OutItem> + 'a>
    where
        T: Iterator<Item = Self::InItem> + 'a;
}

(example in https://github.com/rth/vtext/blob/example/pipeline2/examples/example1.rs)

Maybe PipelineStep instead for the name?

DummyTokenizer has a transform_text method for a single piece of text and then implements transform for multiple items of text

To avoid this inconsistency I wonder if the following would be done,

impl<'a> PipelineComponent<'a> for DummyTokenizer {
    type In = Iterator<Item = &'a str>;
    type OutItem = StringWrap<'a>;

    fn transform<T: Self::In + 'a>(&'a self, items: T) -> Box<dyn Iterator<Item = Self::OutItem> + 'a>

    { ... }
}

(so I far I haven't managed to make this compile, but I'll try later).

Overall the parametrizable types for input/output sound good.

1) tokenising the text and outputting either a Vec<&str> or Vec

As an alternative we could also consider using smolstr or smartstring to return a more consistent type. Benchmarks can be found in https://fasterthanli.me/articles/small-strings-in-rust

Another thought regarding pipeline design.

I am experimenting with creating Rust functions that input and output iterators that can be linked together in Python. For example in the below code: the function string_to_iterator creates an iterator from a given input string (it splits the string) and the function upercase applies to_upercase to each item in a given input iterator and outputs an iterator.

So the python signatures of the two functions are (roughly):

def string_to_iterator(input: str) -> Iterator[str]
def upercase(input: Iterator[str]) -> Iterator[str]

Both string_to_iterator and upercase are functions written in Rust with a PyO3 wrapper. The key thing is that it is possible to feed the input of one to the output of the other dynamically in Python without converting the intermediate output to Python objects.

So in Python

from rustlib import string_to_iterator, upercase

iter_1 = string_to_iterator("a bb ccc 333")  # Creates an iterator
iter_2 = upercase(iter_1) # Creates an new iterator from the previous
print(iter_2.__next__()) # "A"
print(iter_2.__next__()) # "BB"

So iter_2 takes the input from iter_1 and this is done without the values being converted to Python objects and back to Rust.

This is achieved by wrapping iterators in a struct IterBox which holds a boxed iterator. The IterBox is what is passed between the Python functions.

Here is the Rust code:

extern crate pyo3;
use pyo3::prelude::*;
use pyo3::{wrap_pyfunction, PyIterProtocol};

#[pyfunction]
fn string_to_iterator<'py>(py: Python<'py>, input: String) -> PyResult<IterBox> {
    let input_vec: Vec<String> = input.split(" ").map(|s| s.to_string()).collect();
    let iter = input_vec.into_iter();

    Ok(IterBox {
        iter: Box::new(iter)
    })
}

#[pyfunction]
fn upercase<'py>(py: Python<'py>, iter: IterBox) -> PyResult<IterBox> {
    let iter_new = iter.iter.map(|s| s.to_uppercase());

    Ok(IterBox {
        iter: Box::new(iter_new)
    })
}

#[pyclass]
struct IterBox {
    iter: Box<dyn CloneIterator<Item=String> + Send>,
}

#[pyproto]
impl PyIterProtocol for IterBox {
    fn __iter__(slf: PyRefMut<Self>) -> Py<Self> {
        slf.into()
    }

    fn __next__(mut slf: PyRefMut<Self>) -> Option<String> {
        slf.iter.next()
    }
}

// --- Make IterBox cloneable
trait CloneIterator: Iterator + Send {
    fn clone_box(&self) -> Box<dyn CloneIterator<Item = Self::Item> + Send>;
}

impl<T> CloneIterator for T
    where
        T: 'static + Iterator + Clone + Send,
{
    fn clone_box(&self) -> Box<dyn CloneIterator<Item = Self::Item> + Send> {
        Box::new(self.clone())
    }
}

impl Clone for Box<dyn CloneIterator<Item=String> + Send> {
    fn clone(&self) -> Box<dyn CloneIterator<Item=String> + Send> {
        (**self).clone_box()
    }
}

impl Clone for IterBox {
    fn clone(&self) -> IterBox {
        IterBox {
            iter: self.iter.clone_box()
        }
    }
}
// ------

#[pymodule]
fn rustlib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_wrapped(wrap_pyfunction!(string_to_iterator))?;
    m.add_class::<IterBox>()?;
    m.add_wrapped(wrap_pyfunction!(upercase))?;
    Ok(())
}

I think further work could be done to make the current implementation more ergonomic as it currently clones the iterator when passed as an argument value. But I hope I paint the general picture of whats achievable.

This setup could be used as a way of linking pipeline objects in Python. This might mitigate the need to have a pipeline object in Rust which is compatible with Python.

rth / vtext

NLP pipeline design #21

Data structure

Pipeline