rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Make to_ascii_lowercase optional #63

Open technic opened 4 years ago

technic commented 4 years ago

Hi thanks for cool crate!

Could you remove or make to_ascii_lowercase optional? I think such pre-processing should be done on the library client side, since it is simple (.map(|doc| doc.to_ascii_lowercase())), and is not required for main heavy tokenization fitting and transform logic, I would prefer to call it my self when needed.

rth commented 4 years ago

I agree it should definitely be optional. CountVectorizerParams already has a parameter for it, but currently it's not used.

I think such pre-processing should be done on the library client side

Well, to_ascii_lowercase used for now is indeed fast, but proper unicode lowercasing with str::to_lowercase is significantly slower https://github.com/rust-lang/rust/issues/26244#issuecomment-344525748 and could also benefit from being used in that parallel pipeline.

technic commented 4 years ago

Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str signature. Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self as a first argument, and create new instance of tokenizer in each thread.

rth commented 4 years ago

Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str signature.

PR would be welcome. Initially it was &str to avoid memory copies, but maybe realistically there is no way around it with a somewhat generic API.

Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self as a first argument, and create new instance of tokenizer in each thread.

What's the use case for an internal state in tokenizers? RegexpTokenizer does have an internal state inside Regex (and creating that is slow), but I would still think that Tokenizer.tokenize shouldn't change the internal state in general..

technic commented 4 years ago

RegexpTokenizer does have an internal state inside Regex

I think Regex uses RefCell inside to maintain some cache, and hide internal mutability. Well, this is the other option.