Open technic opened 4 years ago
I agree it should definitely be optional. CountVectorizerParams
already has a parameter for it, but currently it's not used.
I think such pre-processing should be done on the library client side
Well, to_ascii_lowercase
used for now is indeed fast, but proper unicode lowercasing with str::to_lowercase
is significantly slower https://github.com/rust-lang/rust/issues/26244#issuecomment-344525748 and could also benefit from being used in that parallel pipeline.
Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str
signature. Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self
as a first argument, and create new instance of tokenizer in each thread.
Maybe you can pass optional lambda as an argument. Because tokenizer cannot do this, due to it's &str -> &str signature.
PR would be welcome. Initially it was &str
to avoid memory copies, but maybe realistically there is no way around it with a somewhat generic API.
Actually it may be useful to hold internal state for the tokenizer, like some buffer which can be shared between two iterator instances. But for that you have to make tokenize to have &mut self as a first argument, and create new instance of tokenizer in each thread.
What's the use case for an internal state in tokenizers? RegexpTokenizer
does have an internal state inside Regex
(and creating that is slow), but I would still think that Tokenizer.tokenize
shouldn't change the internal state in general..
RegexpTokenizer does have an internal state inside Regex
I think Regex
uses RefCell
inside to maintain some cache, and hide internal mutability. Well, this is the other option.
Hi thanks for cool crate!
Could you remove or make
to_ascii_lowercase
optional? I think such pre-processing should be done on the library client side, since it is simple (.map(|doc| doc.to_ascii_lowercase())
), and is not required for main heavy tokenization fitting and transform logic, I would prefer to call it my self when needed.