rurban / smhasher

Hash function quality and speed tests
https://rurban.github.io/smhasher/
Other
1.81k stars 174 forks source link

Add interfaces to popular Rust hash libraries #276

Open tgross35 opened 1 year ago

tgross35 commented 1 year ago

Fixes #69, #94

This adds bindings to Rust implementations of the following algorithms:

C interfaces are in rust-hashes/rust_hashes.h. You can build librust_hashes.a with cargo run --release (output is in target/release). I assumed here that *out is at least 512 bits, keccak256full_rstest is a special case that needs 1600 bits if run.

I am not sure how to tie these in to the rest of the system. If you could do that it would be great, or just tell me how.

rurban commented 1 year ago

Looks great, but I'll need a few days to integrate and test these. Still at home with covid for a few more days

tgross35 commented 1 year ago

Thanks for the update, sounds good. Hope you feel better soon.

tgross35 commented 1 year ago

I think I got it all linked up, the tests run now. Few things I wasn't sure about:

I plan to expose Rust's HashMap API so that can be tested directly rather than going via unordered_map but haven't gotten to that yet

Non GHA CI probably does not work yet, may need some help there.

rurban commented 1 year ago

See the branch tgross35-add-rust with my cmake cleanups. but still I get a linker error: (fc38)

/usr/lib64/ccache/g++ -std=c++11 -march=native -DRUST_ENABLED -DLTO -O3 -DNDEBUG -flto=auto -fno-fat-lto-objects CMakeFiles/SMHasher.dir/main.cpp.o -o SMHasher libSMHasherSupport.a -lhighwayhash -lahash_c /home/rurban/Software/smhasher/rust-hashes/target/release/librust_hashes.a /usr/bin/ld: /home/rurban/Software/smhasher/rust-hashes/target/release/librust_hashes.a(rust_hashes-db6e9d89cc853d1d.rust_hashes.372886951253229d-cgu.0.rcgu.o): in function rust_eh_personality': /builddir/build/BUILD/rustc-1.72.0-src/library/std/src/personality/gcc.rs:251: multiple definition ofrust_eh_personality'; /usr/local/lib/libahash_c.a(ahash_c-11c0a330212a2bf2.ahash_c.6drpysfk-cgu.1.rcgu.o):/builddir/build/BUILD/rustc-1.50.0-src/library/panic_unwind/src/gcc.rs:282: first defined here

rurban commented 1 year ago

How does verification and quality in HashInfo need to be set?

I"ll do this manually. verification can be tested via --verbose --test=VerifyAll

Does anything need to be done to indicate hash algorithms that do not use the seed? Or hashes that use a smaller/bigger seed?

So far we went with 32bit seeds only, as 64bit seeds are too hard to test. self-seeded hashes get verification 0. Generally all crypto hashes don't support a seed per se, so I added seed support to them manually. (mixing it into the IV).

Some of these tests have very large hashes, e.g. Keccak-256 CryptoNight, and are probably pretty slow. Is there a way to adjust the maximum timing of these hashes so as not to exceed limits, or disable them by default?

I adjusted some test values for these overly slow crypto hashes, or disable them completely when they bring nothing to the table. in our case Rust vs C++ would be interesting though, esp. when parallized.

Are MSAN/LeakSanitizer/ControlFlowIntegrity run at any point? Rust supports these (adding -Zsanitizer=msan to RUSTFLAGS during build) but only one at a time

Not yet, but should be done somewhen. I've only added asan, ubsan. I don't expect them to fail msan/cfi, but who knows. Quality was surprisingly mixed.

tgross35 commented 1 year ago

Thanks for the fixes, I just enabled LTO and fixed RUSTFLAGS since I it seemed like that was not getting set properly (diff).

Are you able to disable the ahash_c interface and see if that works? I think there are probably some Rust symbol conflicts between that staticlib and this one since they're currently getting linked at the same time

rurban commented 1 year ago

Yes it clashes only with the ahash_c. Maybe just add ahash to librust_hashes

tgross35 commented 1 year ago

Thanks for these detailed responses, I just saw them

Does anything need to be done to indicate hash algorithms that do not use the seed? Or hashes that use a smaller/bigger seed?

So far we went with 32bit seeds only, as 64bit seeds are too hard to test. self-seeded hashes get verification 0. Generally all crypto hashes don't support a seed per se, so I added seed support to them manually. (mixing it into the IV).

This makes it sound like there are tests that need to sweep the entire seed space? I am not too familiar with the test algorithms. All the hashes that take a u64 or larger seed currently use the provided seed and zero-extend it to the needed size, but I can change them to just ignore the seed if that is preferred.

Some of these tests have very large hashes, e.g. Keccak-256 CryptoNight, and are probably pretty slow. Is there a way to adjust the maximum timing of these hashes so as not to exceed limits, or disable them by default?

I adjusted some test values for these overly slow crypto hashes, or disable them completely when they bring nothing to the table. in our case Rust vs C++ would be interesting though, esp. when parallized.

I suspect the C++ implementations will win out much of the time :) Many of these Rust implementations come from the rustcrypto org which generally forbids handwritten asm - meaning it's trivial to get them to compile on obscure targets and is generally significantly less UB, but also more difficult to get that last 10% of possible performance.

Rust code does tend to optimize a bit better than C when neither have handwrittern instructions (near-ideal restrict keyword use definitely helps the compiler) so I am curious to see the results.

Are MSAN/LeakSanitizer/ControlFlowIntegrity run at any point? Rust supports these (adding -Zsanitizer=msan to RUSTFLAGS during build) but only one at a time

Not yet, but should be done somewhen. I've only added asan, ubsan. I don't expect them to fail msan/cfi, but who knows. Quality was surprisingly mixed.

Rust also has an interpreter miri that is used for UB detection and can detect some things that asan/msan can't. An experiment that runs smhasher's input suite on the rust algorithms via miri would be interesting.

Yes it clashes only with the ahash_c. Maybe just add ahash to librust_hashes

It is there currently, under the symbol ahash_rs. I see now that they use a different implementation for the seeds https://github.com/tkaitchuck/aHash/blob/master/smhasher/ahash-cbindings/src/lib.rs#L10, this implementation can be updated to match

rurban commented 1 year ago

The old ahash has much better seeds. The new one has tons of bad seeds. So either remove it, or match it.

rurban commented 1 year ago

Ad rust restrict: rust depends on llvm, and this still has a broken restrict implementation with small inlinable loops.

rurban commented 1 year ago

All the hashes that take a u64 or larger seed currently use the provided seed and zero-extend it to the needed size, but I can change them to just ignore the seed if that is preferred.

zero-extension is fine. no seed makes them almost unusable for us.

tgross35 commented 1 year ago

I adjusted this implementation to use the same invocation as the current ahash. Is repeating the seed preferred to zero extending? I don't understand the exact vector.

Seems to be some linking errors on Windows. 64 bit github CI can't find some .libs and 32 bit on appveyor can't find any symbols - think I just might need to have it regenerate the bindings on that target. There is also a duplicate symbol error on aarch64, seems like blake3 uses the C implementation only on that platform. Think maybe the easiest thing there is just to rename those symbols in rust's output staticlib.

rurban commented 1 year ago

Have a look at the results in the tgross35-add-rust branch. Most crypto branches have lots of failures. So far I haven't found anything good to keep

tgross35 commented 1 year ago

Great to have results, thank you for the update.

What does the "invalid hash bit width" failure mean? Is this the failure you were referring to, or did you mean algorithmic failures? I wonder if I could have hooked something up wrong, or if some of this would be different if I repeated the keys rather than zero-extending.

Also, that branch does not have the recent change to ahash that I added in 5c3057938cab20233b3f4a810b6380ab801b1727 to make it match the previous implementation.

tgross35 commented 1 year ago

The blake hashes also take some optional input variables, I just used defaults since I was not entirely sure how to use them. I can update the usage if you have suggestions

rurban commented 1 year ago

Great to have results, thank you for the update.

What does the "invalid hash bit width" failure mean? Is this the failure you were referring to, or did you mean algorithmic failures?

This is just the 123 typo. Will be tested in the next round. Testing is extremely slow on my home laptop.

I wonder if I could have hooked something up wrong

Also, that branch does not have the recent change to ahash that I added in 5c30579 to make it match the previous implementation.

Yes, that's missing. Should be the same as the standalone ahash. More worrying are the crypto troubles. I'll look into them later

rurban commented 1 year ago

I've tested the new ahash_rs, it"s even worse than the old. worse momentchi2, and 2x bad seeds. let's keep the old, please

tgross35 commented 12 months ago

I've tested the new ahash_rs, it"s even worse than the old. worse momentchi2, and 2x bad seeds. let's keep the old, please

That is weird, it should be doing the exact same thing. Do you know what version of the library it was testing against before?

I can run some of these on my home computer which should be faster, just haven't gotten the chance.

ogxd commented 10 months ago

Hello! Any progress on this? I have this PR pending for adding gxhash but I think it might be better to simply add it as a rust dependency as I've put a lot more effort into this rust version, and is available on crates.io

tgross35 commented 10 months ago

Hey @ogxd,

I haven't yet taken another look at this so no, there are no updates. I am happy to add gxhash to this list before it gets closer to being mergeable, but I don't think you need to hold off on adding the C implementation as well.