Open chris-ha458 opened 1 year ago
Thank you for issue.
Rayon
is undoubtedly a good idea for parallel iterations, it should probably be implemented with feature flag.
Do you mean the hashing algorithm for LRU caching in functions (there are a lot of hashing)?
Why do you think that sort_unstable()
will be faster than sort()
? According to documentation it uses quicksort
algorithm (sort uses mergesort
). Mergesort
has better time complexity in comparison to quicksort, hasn't?
I tried to replace sort()
with sort_unstable()
and sort_by()
with sort_unstable_by()
. There is almost no difference in speed, at least for this performance test and benchmark large_datasets
. But unstable sorting can produce different results. So I'm not sure it makes sense to use it.
Do you mean the hashing algorithm for LRU caching in functions (there are a lot of hashing)?
In rust, hashmaps themselves can use an alternative hashing method for speed or other reasons. The default hasher is siphash1-3 which is ddos resistant, but not fast.
Why do you think that sort_unstable() will be faster than sort()
This is what the documentation says :
When applicable, unstable sorting is preferred because it is generally faster than stable sorting and it doesn’t allocate auxiliary memory. See sort_unstable.
Also sort on vec is timsort
and sort_unstable is pdqsort
Unstable sort is only relevant if there are other indexes to consider. Atleast for the sorted indexes it should yield the same result.
In case different results are not desirable, I think it would be a good idea to have a unit or integration test to verify expected behavior.
OK, I will push sort_unstable to repo. Concerning the hashing function - we need to try to change it. There are a lot of hash operations because of LRU cache, it can help!
hm, it looks like I tested it wrongly and unstable sort speed up it for 10-20%.
Yes, I understand. I plan to work on hash function replacement at this time. Rayon will be implemented later, not a goal right now. Originally posted by @nickspring in https://github.com/nickspring/charset-normalizer-rs/issues/8#issuecomment-1735576667
Some recommendations for hash functions :
use std::collections::HashMap;
use std::hash::BuildHasherDefault;
use highway::HighwayHasher;
let mut map =
HashMap::with_hasher(BuildHasherDefault::<HighwayHasher>::default());
map.insert(1, 2); assert_eq!(map.get(&1), Some(&2));
If I had to choose, I would have chosen highwayhash.
(No known hash vulernabilities like fxhash,ahash,xxhash,wyhash. Much faster than any cryptographic hash and competitive with non-cryptographic good enough hashes. Pure idiomatic rust implementation with SIMD intrinsic pathways.)
Actually it's good question. I tested standard hasher on big file and it was slow. And I added blake3 hasher (not sure if we need fingerprint, as it is in Python version, at all for comparison, maybe we can compare just decoded_payload)
https://github.com/nickspring/charset-normalizer-rs/blob/4d9fd71bcfe456760cbcce6c0d1d82ab8e380b29/src/entity.rs#L178
It is an unstable hash, meaning there is no guarantee that there will be same hashes across platforms or versions. This might become a problem if/when serde becomes a goal, but the way the repo is used currently it won't be a problem. We could implement highwayhash or others when it does become a problem.
As such for short term, further hasher
investigations would be lower priority.
To improve from here we would have to implement benchmarks to compare siphash,fxhash,ahash etc.
Hashing is used only for caching, so it doesn't matter that it's unstable. Also, we don't serialize hashes via serde. I сhose ahash as it's implemented already in cached library as a feature.
I did some experiments with Rayon. At this time it looks like it has high overhead. So I plan to try to rewrite mess measurement code in more efficient way (I have some ideas).
I was looking into the lib.rs
Two main loops.
Especially the named loops 'iana_encodings_loop and 'chunks_loop
Unfortunately, as they are it does not seem like they are easily written into functional Rayon code (if at all)
I agree that some other parts of the code are not good candidates for Rayon due to overhead. Hope you find a good approach!
I'm exploring integrating encoding_rs instead of rust-encoding.
rust-encoding seems to be abandoned and unmaintained. encoding_rs is better supported. However, they work in fundamentally different ways and encoding_rs does not support many current constructs.
It would obviously take some time.
I'm exploring integrating encoding_rs instead of rust-encoding.
rust-encoding seems to be abandoned and unmaintained. encoding_rs is better supported. However, they work in fundamentally different ways and encoding_rs does not support many current constructs.
It would obviously take some time.
Yes, it doesn't fit to the concept of this library.
It looks like new mess detector approach is faster #31
--> A) charset-normalizer-rs Conclusions
--> Accuracy: 97.1%
--> Total time: 1.147958363s
--> Avg time: 2.806743ms
--> 50th: 1.044498ms
--> 95th: 8.228578ms
--> 99th: 19.302748ms
------------------------------
--> B) chardet Conclusions
--> Faster than charset-normalizer-rs by 0.6 times
--> Accuracy: 82.6%
--> Total time: 2.083252561s
--> Avg time: 5.093526ms
--> 50th: 331.699µs
--> 95th: 2.515793ms
--> 99th: 16.465157ms
------------------------------
--> C) chardetng Conclusions
--> Faster than charset-normalizer-rs by 0.9 times
--> Accuracy: 90.7%
--> Total time: 1.253477517s
--> Avg time: 3.064736ms
--> 50th: 821.598µs
--> 95th: 8.578877ms
--> 99th: 24.627836ms
However I broke a lot of idiomatic code in ms.rs. Tomorrow I will clean code (remove unused functions, rewrite tests for new mess detector, refactor code).
From the list, only Rayon
remains.
Unfortunately the current implementation is streamlined enough that naive parallelization is not helpful.
If there are any ideas, let me know. In the meantime, feel free to close this comment. I think it would be relevant again when there is a major change in this repo or any upstream dependencies (most of which are also highly optimized)
Yes, I have an idea, to refactor and speed up cd.rs
module. I think it's possible.
As per our discussion in #2 for speed improvements the following has been suggested
The paths I had in mind was these:
Rayon
DashMap
(Currentstd
HashMap implements rayon so not strictly necessary, but might be useful to look into regardless)FxHash
,AHash
,HighwayHash
sort()
withsort_unstable()
~ #6Vec::new()
withVec::with_capacity()
Many of these are low hanging fruit and related to refactoring the code to idiomatic Rust code. For example, there are many for loops in this code. Iterator based code is more idiomatic, easier to improve with
rayon
, and interact better with allocation. (pushing items from within a for loop can cause multiple allocs and copies, while collecting an iterator can allow fewer allocations.)