nickspring / charset-normalizer-rs

Truly universal encoding detector in pure Rust - port of Python version
https://crates.io/crates/charset-normalizer-rs
MIT License
33 stars 3 forks source link

Improvements : Speed #3

Open chris-ha458 opened 1 year ago

chris-ha458 commented 1 year ago

As per our discussion in #2 for speed improvements the following has been suggested

The paths I had in mind was these:

Many of these are low hanging fruit and related to refactoring the code to idiomatic Rust code. For example, there are many for loops in this code. Iterator based code is more idiomatic, easier to improve with rayon, and interact better with allocation. (pushing items from within a for loop can cause multiple allocs and copies, while collecting an iterator can allow fewer allocations.)

nickspring commented 1 year ago

Thank you for issue.

nickspring commented 1 year ago

I tried to replace sort() with sort_unstable() and sort_by() with sort_unstable_by(). There is almost no difference in speed, at least for this performance test and benchmark large_datasets. But unstable sorting can produce different results. So I'm not sure it makes sense to use it.

chris-ha458 commented 1 year ago

Do you mean the hashing algorithm for LRU caching in functions (there are a lot of hashing)?

In rust, hashmaps themselves can use an alternative hashing method for speed or other reasons. The default hasher is siphash1-3 which is ddos resistant, but not fast.

Why do you think that sort_unstable() will be faster than sort()

This is what the documentation says :

When applicable, unstable sorting is preferred because it is generally faster than stable sorting and it doesn’t allocate auxiliary memory. See sort_unstable.

Also sort on vec is timsort and sort_unstable is pdqsort

Unstable sort is only relevant if there are other indexes to consider. Atleast for the sorted indexes it should yield the same result.

In case different results are not desirable, I think it would be a good idea to have a unit or integration test to verify expected behavior.

nickspring commented 1 year ago

OK, I will push sort_unstable to repo. Concerning the hashing function - we need to try to change it. There are a lot of hash operations because of LRU cache, it can help!

nickspring commented 1 year ago

hm, it looks like I tested it wrongly and unstable sort speed up it for 10-20%.

chris-ha458 commented 1 year ago

Yes, I understand. I plan to work on hash function replacement at this time. Rayon will be implemented later, not a goal right now. Originally posted by @nickspring in https://github.com/nickspring/charset-normalizer-rs/issues/8#issuecomment-1735576667

Some recommendations for hash functions :

  1. FxHash : I don't like it particularly, but I thought i'd share since this is often the first choice in this kind of situation.
  2. aHash : This is not stable (the hash values change from version to version) and only seeks to be the fastest.
  3. highway-rs : This is often overlooked but it is very fast, not cryptographically tested but gone through extensive testing and validation(original paper and Smhasher). It also has sample code on how one might implement it.
    
    use std::collections::HashMap;
    use std::hash::BuildHasherDefault;
    use highway::HighwayHasher;
    let mut map =
    HashMap::with_hasher(BuildHasherDefault::<HighwayHasher>::default());

map.insert(1, 2); assert_eq!(map.get(&1), Some(&2));



If I had to choose, I would have chosen highwayhash. 
(No known hash vulernabilities like fxhash,ahash,xxhash,wyhash. Much faster than any cryptographic hash and competitive with non-cryptographic good enough hashes. Pure idiomatic rust implementation with SIMD intrinsic pathways.)
nickspring commented 1 year ago

Actually it's good question. I tested standard hasher on big file and it was slow. And I added blake3 hasher (not sure if we need fingerprint, as it is in Python version, at all for comparison, maybe we can compare just decoded_payload)
https://github.com/nickspring/charset-normalizer-rs/blob/4d9fd71bcfe456760cbcce6c0d1d82ab8e380b29/src/entity.rs#L178

chris-ha458 commented 1 year ago

14 implements aHash. For pure speed with little other considerations, it is good enough.

It is an unstable hash, meaning there is no guarantee that there will be same hashes across platforms or versions. This might become a problem if/when serde becomes a goal, but the way the repo is used currently it won't be a problem. We could implement highwayhash or others when it does become a problem.

As such for short term, further hasher investigations would be lower priority. To improve from here we would have to implement benchmarks to compare siphash,fxhash,ahash etc.

nickspring commented 1 year ago

Hashing is used only for caching, so it doesn't matter that it's unstable. Also, we don't serialize hashes via serde. I сhose ahash as it's implemented already in cached library as a feature.

nickspring commented 1 year ago

I did some experiments with Rayon. At this time it looks like it has high overhead. So I plan to try to rewrite mess measurement code in more efficient way (I have some ideas).

chris-ha458 commented 1 year ago

I was looking into the lib.rs Two main loops. Especially the named loops 'iana_encodings_loop and 'chunks_loop

Unfortunately, as they are it does not seem like they are easily written into functional Rayon code (if at all)

I agree that some other parts of the code are not good candidates for Rayon due to overhead. Hope you find a good approach!

chris-ha458 commented 1 year ago

I'm exploring integrating encoding_rs instead of rust-encoding.

rust-encoding seems to be abandoned and unmaintained. encoding_rs is better supported. However, they work in fundamentally different ways and encoding_rs does not support many current constructs.

It would obviously take some time.

nickspring commented 1 year ago

I'm exploring integrating encoding_rs instead of rust-encoding.

rust-encoding seems to be abandoned and unmaintained. encoding_rs is better supported. However, they work in fundamentally different ways and encoding_rs does not support many current constructs.

It would obviously take some time.

Yes, it doesn't fit to the concept of this library.

nickspring commented 1 year ago

It looks like new mess detector approach is faster #31

--> A) charset-normalizer-rs Conclusions
   --> Accuracy: 97.1%
   --> Total time: 1.147958363s
   --> Avg time: 2.806743ms
   --> 50th: 1.044498ms
   --> 95th: 8.228578ms
   --> 99th: 19.302748ms

------------------------------
--> B) chardet Conclusions
   --> Faster than charset-normalizer-rs by 0.6 times
   --> Accuracy: 82.6%
   --> Total time: 2.083252561s
   --> Avg time: 5.093526ms
   --> 50th: 331.699µs
   --> 95th: 2.515793ms
   --> 99th: 16.465157ms

------------------------------
--> C) chardetng Conclusions
   --> Faster than charset-normalizer-rs by 0.9 times
   --> Accuracy: 90.7%
   --> Total time: 1.253477517s
   --> Avg time: 3.064736ms
   --> 50th: 821.598µs
   --> 95th: 8.578877ms
   --> 99th: 24.627836ms

However I broke a lot of idiomatic code in ms.rs. Tomorrow I will clean code (remove unused functions, rewrite tests for new mess detector, refactor code).

chris-ha458 commented 1 year ago

From the list, only Rayon remains. Unfortunately the current implementation is streamlined enough that naive parallelization is not helpful.

If there are any ideas, let me know. In the meantime, feel free to close this comment. I think it would be relevant again when there is a major change in this repo or any upstream dependencies (most of which are also highly optimized)

nickspring commented 1 year ago

Yes, I have an idea, to refactor and speed up cd.rs module. I think it's possible.