rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
146 stars 11 forks source link

Support different hash functions in HashingVectorizer #10

Closed rth closed 5 years ago

rth commented 5 years ago

Currently, we use the MurmurHash3 hash function from the rust-fasthash (to be more similar to scikit-learn implementation). That crate also supports a number of other hash functions,

City Hash Farm Hash Metro Hash Mum Hash Sea Hash Spooky Hash T1 Hash xx Hash

I'm not convinced hashing is currently the performance bottleneck, but in any case using a faster hash function such as xxhash would not hurt.

This would involve updating the text-vectorize crate and adding hasher parameter to the HashingVectorizer python estimator.

Another use case could to use different hash functions to reduce the effect of collisions Svenstrup et. al. 2017, discussed e.g. in https://stackoverflow.com/q/53767469/1791279

rth commented 5 years ago

Just to confirm, that choice of the hash function has mostly no impact on performance as it is not the bottleneck.

rth commented 5 years ago

Just to confirm, that choice of the hash function has mostly no impact on performance as it is not the bottleneck.

Closing.