rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.61k stars 12.74k forks source link

Review which hash algorithm to use for incremental compilation hashes. #41215

Closed michaelwoerister closed 7 years ago

michaelwoerister commented 7 years ago

For incremental compilation we are hashing lots of things, HIR, crate metadata and soon also type checking tables. These hashes are used purely as fingerprints, that is, we compare these hashes in order to find out if the things having been hashed are equal or not. Consequently the primary qualities we require from the hash algorithm are:

  1. That it acts as a pseudo-random-function, i.e. whatever we feed in, the resulting hashes should have a collision probability that is roughly the same as that of random numbers.
  2. That it produces wide enough hash values so the collision probabilities are low enough (128 bits should be fine).
  3. That it is fast.

It remains to be clarified whether it makes sense to use a cryptographic hash function, i.e. that it is hard to find/construct collisions. My guess is no:

I'd be very interested though if someone could come up with an attack scenario. But even then the solution would probably be to implement some general code signing scheme for these caches.

At the moment we are using BLAKE2 as our hashing algorithm which is a cryptographic hash function and thus fulfills requirements (1) and (2) very well. However, compared to non-cryptographic hash functions it is rather slow. For raw throughput SipHash would be about three times as fast and fast hash functions like Metrohash can be 20 times as fast. Before you get too excited though: Most of the time computing hashes is spent gathering the data that is fed into the hash function, not actually running the function on it. But there are still some gains to be had, here are some preliminary tests of using BLAKE2 -- which is our "worst case" -- and MetroHash -- which is among the fastest hash functions providing 128 bit hashes:

BLAKE2 MetroHash SipHash 1-3 SipHash 2-4
libcore 0.258s 0.193s (-25%) 0.189s (-26%) 0.201s (-22%)
librustc 0.435s 0.345s (-20%) 0.320s (-26%) 0.327s (-25%)
syntex_syntax (HIR) 0.221s 0.166s (-25%) 0.168s (-24%) 0.176s (-20%)
syntex_syntax (Metadata) 0.456s 0.326s (-28%) 0.312s (-31%) 0.347s (-24%)

So it seems that a fast hash functions allows to save 20-30% spent during ICH computation. If we could get some confidence that:

then we could have some free speedups here.

UPDATE: I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.

est31 commented 7 years ago

As another idea, if non cryptographic hashes are allowed, we could use SHA-1 on platforms where its hardware accelerated (apparently Ryzen and newer and Goldmont and newer). SHA-1 is slightly slower than BLAKE2, but I guess with hardware acceleration it will be much faster, maybe faster than a MetroHash implemented with generic instructions. Would be interesting to have benchmarks on this.

michaelwoerister commented 7 years ago

That's an interesting idea. However, I'd need to see actual numbers for that and they would have to be very compelling in order to justify adding platform-specific code. Also, once we go down the platform-specific road: MetroHash can also make use of some special instructions, almost doubling its speed.

michaelwoerister commented 7 years ago

A 128 bit SipHash13 might also be a good contender. It's pretty fast and it's designed to be a good PRF.

arthurprs commented 7 years ago

Siphash is probably a sensible choice here if the numbers look good. Metrohash is great but I think we should try to be conservative to some extent.

michaelwoerister commented 7 years ago

@rust-lang/libs: There are no plans of making the 128-bit version of SipHash available in the standard library, right?

sfackler commented 7 years ago

Not in a stable manner, no, but that shouldn't matter for compiler internals.

est31 commented 7 years ago

Also, afaik its possible to simply use crates from crates.io now. Just add them to the Cargo.toml and commit both that change and the change to Cargo.lock into git.

michaelwoerister commented 7 years ago

I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.

arthurprs commented 7 years ago

Sounds like a clear winner to me. Is that 1-3 or 2-4 variant?

michaelwoerister commented 7 years ago

It's 1-3 ...

michaelwoerister commented 7 years ago

I added the SipHash 2-4 timings too. As expected they are slightly worse but still pretty good.

briansmith commented 7 years ago

That it produces wide enough hash values so the collision probabilities are low enough (128 bits should be fine).

NIT: It doesn't make sense to measure hashes by the number of bits this way, because the number of bits doesn't correlate strongly enough with the desired collision probability. For example, `SHA-1(x)||"123456789012" is a 256-bit hash but it isn't nearly as strong as SHA-256.

Instead, it would be good to mention the desired collision probability. Also, what's the threat model? I.e. is there any way that something could go wrong if there's a collision? AFAICT a collision could easily be disasterous.

Also, I don't think the use of non-Rust implementations should be rejected since a huge amount of the compiler isn't written in Rust already. I would rather the compiler use a safer algorithm written in assembly language then a weaker algorithm that's implemented in Rust.

michaelwoerister commented 7 years ago

NIT: It doesn't make sense to measure hashes by the number of bits this way, because the number of bits doesn't correlate strongly enough with the desired collision probability. For example, `SHA-1(x)||"123456789012" is a 256-bit hash but it isn't nearly as strong as SHA-256.

I'm working under the assumption that a high-quality hash function will use all bits available.

Also, what's the threat model? I.e. is there any way that something could go wrong if there's a collision? AFAICT a collision could easily be disasterous.

A collision would result in some piece data not being detected as having changed. I would not talk about "threat model" though. Cache files are not meant to be distributed. An attacker would be better off just manipulating rlibs directly.

Also, I don't think the use of non-Rust implementations should be rejected since a huge amount of the compiler isn't written in Rust already. I would rather the compiler use a safer algorithm written in assembly language then a weaker algorithm that's implemented in Rust.

Nobody said it has to be written in Rust. But maintenance cost is a factor too.

nagisa commented 7 years ago

SHA-256 based on the dedicated SHA instructions can do 1.7GB/s on Ryzen. Probably more for SHA-1, but I didn’t benchmark it yet. AVX2 implementation of Fletcher4 that ZFS uses for checksumming does 15GB/s on the same machine. That same Fletcher4 is approximately 2GB/s for plain-old C version.

(EDIT: of course Fletcher4 is good for checksumming and does not really work well for avoiding collisions)

arthurprs commented 7 years ago

Any reason NOT to use siphash2-4 on this?

It's gonna give rustc 90% of the speed benefit and it'll leave little to no questions regarding maintaining platform specific code, hashing quality or security (the last is a weak argument, but it exists anyway).

michaelwoerister commented 7 years ago

Any reason NOT to use siphash2-4 on this?

It seems like an improvement over what we have now (which does not preclude that there are still better options). Leaving security aside, is there any reason NOT to use siphash1-3? Does hashing quality degrade with fewer rounds?

arthurprs commented 7 years ago

Apparently there's a slight loss of quality on smhasher (avalanche test worse bias from 0.72% (2-4) to 0.9%(1-3)), not sure if it's significant.

source: https://github.com/rurban/smhasher

michaelwoerister commented 7 years ago

@alexcrichton What's the state of using crates.io crates in the compiler. Is that something we can do easily now? I guess that we have to at least vendor each crate we are using, right?

Mark-Simulacrum commented 7 years ago

As far as I know, crates within the compiler itself can just be depended upon and everything should just work. I think crates that are depended upon by libstd or it's dependencies can't do that yet, but I'm not sure.

alexcrichton commented 7 years ago

@michaelwoerister as of https://github.com/rust-lang/rust/pull/41847 we should be able to just depend on crates.io crates and have things like stability, vendoring, license checking, etc, all just fall out. Unfortunately really leveraging https://github.com/rust-lang/rust/pull/41847 will require a new stage0 compiler which supports that flag.

michaelwoerister commented 7 years ago

Thanks for the info @Mark-Simulacrum and @alexcrichton!

Unfortunately really leveraging #41847 will require a new stage0 compiler which supports that flag.

So this will be available when the next beta comes out in a couple of weeks? That's soon enough, there's no rush with this issue.

alexcrichton commented 7 years ago

June 8!

joshtriplett commented 7 years ago

Has anyone tested seahash? It's several times faster than SipHash and FNV, and has a native Rust implementation. Trying it should be easy given crates.io crate support.

This doesn't need a cryptographically strong hash, and seahash should beat even a hardware-accelerated SHA-256. (Seahash itself appears designed to use hardware-accelerated vectorization where available, while remaining portable.)

michaelwoerister commented 7 years ago

Looks promising. @ticki, is there also a version of seahash that provides 128 bit outputs (or more)?

est31 commented 7 years ago

One way to get 128 bits of output would be to do this in the finalize step instead of the "xor them all together":

a^=c;
c^=d;
b=a;
d=c;
a^=written;
helper::diffuse(a);
helper::diffuse(c);
a^=c;
c^=b;
helper::diffuse(a);
helper::diffuse(c);
return (a,c);

But maybe @ticki has a better idea.

michaelwoerister commented 7 years ago

It's time to test again because the freshly merged red/green tracking system does quite a bit more hashing than the previous system. @Mark-Simulacrum was so nice as to run some try-branches through perf.rlo. Here are the results (numbers are "instructions executed" compared to BLAKE2 implementation)

TEST CASE SipHash128 (2-4) MetroHash128
futures-rs-test-all -0.5% -0.5%
helloworld -1.0% -1.0%
html5ever-2016-08-25 -0.4% -0.6%
hyper.0.5.0 -0.6% -0.7%
inflate-0.1.0 -0.1% -0.2%
issue-20936-deep-vector -1.1% -1.6%
issue-31157-compilation-time-r... -0.8% -1.0%
issue-32062-equality-relations... -0.6% -0.5%
issue-32278-big-array-of-strin... -1.1% -1.8%
issue-43572-unused-uses -5.9% -6.5%
jld-day15-parser -0.3% -0.3%
piston-image-0.10.3 -0.5% -0.6%
regex-0.1.80@010-baseline -0.2% -0.2%
regex-0.1.80@020-incr-from-scr... -4.9% -6.1%
regex-0.1.80@030-compile_one -5.3% -6.5%
regex-0.1.80@040-is_validcap... -5.8% -7.1%
regex-0.1.80@050-expand -4.5% -5.5%
regex-0.1.80@060-Compiler-new -4.2% -5.3%
regex-0.1.80@070-BYTE_FREQUENC... -5.6% -6.9%
regex-0.1.80@080-SparseSet -2.4% -3.1%
regex-0.1.80@090-Job -2.1% -2.4%
regex-0.1.80@100-incr-no-chang... -4.9% -6.0%
regex-opt-0.1.80@010-baseline -0.1% -0.2%
regex-opt-0.1.80@020-incr-from... -2.0% -2.5%
regex-opt-0.1.80@030-compile_o... -3.7% -4.5%
regex-opt-0.1.80@040-isvalid... -5.5% -6.7%
regex-opt-0.1.80@050-expand -5.5% -6.8%
regex-opt-0.1.80@060-Compiler-... -3.7% -4.6%
regex-opt-0.1.80@070-BYTE_FREQ... -5.2% -6.3%
regex-opt-0.1.80@080-SparseSet -1.0% -1.3%
regex-opt-0.1.80@090-Job -1.0% -1.2%
regex-opt-0.1.80@100-incr-no-c... -4.6% -5.6%
regex-opt.0.1.30 -0.1% -0.0%
regex.0.1.30 -0.6% -0.8%
rust-encoding-0.3.0 -0.5% -0.5%
syntex-0.42.2@000-base -0.8% -1.0%
syntex-0.42.2@010-incr -12.4% -15.9%
syntex-0.42.2@020-clean -6.5% -8.1%
tokio-webpush-simple@000-base -0.7% -0.7%
tokio-webpush-simple@010-incr -6.9% -8.6%
tokio-webpush-simple@020-clean -5.3% -6.5%
tokio-webpush-simple@030-minor... -4.4% -5.6%
tuple-stress -1.4% -2.2%

(see https://github.com/rust-lang/rust/pull/45028#issuecomment-334396030 and https://github.com/rust-lang/rust/pull/45026#issuecomment-334396595 for links to the raw data)

Metrohash is always faster than SipHash and both are always faster than BLAKE2. My conclusion is that we should just use MetroHash. It's fast, it seems to provide excellent collision probability and it cannot be misinterpreted as being cryptographically secure.

Also note that I've tested various hash algorithm implementations from crates.io in microbenchmarks and @arthurprs' MetroHash was the fastest, beating SpookyHash, FarmHash, and even SeaHash, XXHash, and t1ha (even though does only provide 64bit hashes out of the box).

arielb1 commented 7 years ago

If this is really correctness-critical, I like the cryptographic guarantees a cryptographic hash gives us and would hate to have correctness possibly be compromised by hash misbehavior.

michaelwoerister commented 7 years ago

@arielb1 Would you be comfortable with using SipHash? It is performing very well too and since it is supposed to combat exploits that are based on hash collisions, the bar for it's output quality is pretty high. Also, I assume that it has had lots of eyes on it since it is used so widely.

arthurprs commented 7 years ago

I'm sure this was answered somewhere already, but what happens if hashes collide?

michaelwoerister commented 7 years ago

I'm sure this was answered somewhere already, but what happens if hashes collide?

Then the compiler would potentially not re-run some computation although it should. The consequences of that depends on the computation/hash in question. The absolute worst case would be that an object file gets re-used although it shouldn't and you silently end up with the previous machine code version of the functions in there instead of the current one.

est31 commented 7 years ago

If this is really correctness-critical, I like the cryptographic guarantees a cryptographic hash gives us and would hate to have correctness possibly be compromised by hash misbehavior.

There are two cases here where collisions could occur theoretically:

One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash, and this happens on purpose. It seems like this is very hard to pull this off. I guess many other ways of attacks are way easier (like creating a compiler plugin and then using unsafe to get access to compiler managed memory).

The second is as a general source of bugs. These collisions might in theory also happen if you are very unlucky. However, I doubt that you'd get "by chance" collisions given the 128 bits.

So generally I don't think the hash is correctness-critical.

michaelwoerister commented 7 years ago

One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash.

It's hard to imagine how that would work in practice. A potential attacker could try to inject some code that has the same hash as some other piece of code that is already in your codebase. But the effect of the hash collision would be that the injected piece of code would not get used since the difference between the two pieces of code is not detected :)

nagisa commented 7 years ago

Could a SHA256 be considered if the host machine can hardware accelerate it? Or is it necessary to only ever use one hash algorithm for some interop reasons?

On Oct 5, 2017 17:48, "Michael Woerister" notifications@github.com wrote:

One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash.

It's hard to imagine how that would work in practice. A potential attacker could try to inject some code that has the same hash as some other piece of code that is already in your codebase. But the effect of the hash collision would be that the injected piece of code would not get used since the difference between the two pieces of code is not detected :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/41215#issuecomment-334488130, or mute the thread https://github.com/notifications/unsubscribe-auth/AApc0koWnNd_xpDJPLwnbYMK_HHyFsOYks5spOwpgaJpZM4M6CaR .

michaelwoerister commented 7 years ago

@nagisa I don't know of a use case where it makes sense to move an incr. comp. cache from one machine to another. I personally would prefer to have just one implementation to maintain but if a hardware accelerated SHA256 was significantly faster than SipHash, then in theory it could be done. Do you have numbers on how fast SHA256 can get?

nagisa commented 7 years ago

@michaelwoerister I posted https://github.com/rust-lang/rust/issues/41215#issuecomment-302394303 earlier in the thread. Comparison is against Fletcher4, which is not very useful here, so I ran a similar test using std::hash::SipHasher24 in an environment as close to what I remember I used for the initial test. SipHasher manages aproximately a 2.33GB/s, so SHA doesn’t seem to be faster.

arthurprs commented 7 years ago

I think 128bit SipHash is the way to go here, there are surely faster hashers but the returns are very small for this workload (MetroHash results sort of prove that). No amount of bits can guarantee no collisions :woman_shrugging: and as far as I understand there's no need for crypto hash security anyway.

Also, I don't fell comfortable using MetroHash or (insert your fav. fast hash here) in places other than hashtables.

arielb1 commented 7 years ago

@michaelwoerister

My worry was more that a specially-crafted piece of "code" (e.g. string data included from some outside source) could cause MetroHash to get into a bad state in which "particular" collisions are "easy". I can't think of a simple way to extend this into an exploit, but stranger things have happened.

I think that SipHash with a per-incremental-directory random key should be, if not something I'm 100% comfortable with, harder to exploit - exploiting it would require finding the SipHash key (which is not that easy) and crafting data based on that. While I can imagine some online situations in which that could be exploitable, this decreases the attack surface.

arielb1 commented 7 years ago

From an academic cryptographer POV, I can hardly call this scheme cryptographically secure. I would consider it a nice opportunity to try and find a contrived way to "break this" if you control all the neighboring degrees of freedom.

Actually, all 128-bit schemes expose an attack with complexity about 2^64 - if there's a "data blob" an attacker can manipulate, he can "birthday attack" his way to a collision and cause some specific target to not be updated. I weakly suspect that sip2-4 has known-key collision attacks with better complexity (it wasn't that well analyzed for collision resistance). 2^64 is already "way too close to be comfortable" (e.g. the bitcoin network computes about that amount of hashes every second), especially 10 years from now.

From a practical standpoint, I think keeping the SIP key reasonably secret would make "generic" attacks more complicated while I still can't see a good way to get a "theoretical" level of security.

joshtriplett commented 7 years ago

Would seahash make sense here? It's quite fast, and collision-resistant.

I'm curious about the microbenchmarks used to evaluate it relative to metrohash, given that seahash's own benchmarks seem to disagree with that result. Might be worth doublechecking that with the seahash and metrohash developers, both.

michaelwoerister commented 7 years ago

From an academic cryptographer POV, I can hardly call this scheme cryptographically secure.

That's what I'm saying all along: None of this is supposed to be cryptographically secure. Using SipHash or even BLAKE2 would not change that.

michaelwoerister commented 7 years ago

@joshtriplett This is the benchmark I used:

#![allow(non_snake_case)]
#![feature(test)]

extern crate test;
extern crate rand;

extern crate metrohash;
extern crate fasthash;

use test;
use rand::{self, Rng};
use metrohash::MetroHash128;
use fasthash::sea::SeaHasher64 as fasthash_SeaHasher64;

fn create_unique_values(count: usize) -> Vec<u8> {
    let mut rng = rand::thread_rng();
    let mut values = Vec::with_capacity(count);
    for _ in 0 .. count {
        values.push(rng.gen())
    }

    values
}

fn run_bench<H: Hasher+Default>(bytes: usize, bh: &mut test::Bencher) {
    let values = create_unique_values(bytes);
    bh.iter(|| {
        let mut hasher = H::default();
        values.hash(&mut hasher);
        test::black_box(hasher.finish());
    })
}

// This is roughly the average amount of data that we hash at a time in the Rust compiler  
const BYTE_COUNT: usize = 400;

#[bench]
fn metrohash128(bh: &mut test::Bencher) {
    run_bench::<MetroHash128>(BYTE_COUNT, bh);
}

#[bench]
fn fasthash_SeaHasher64(bh: &mut test::Bencher) {
    run_bench::<fasthash_SeaHasher64>(BYTE_COUNT, bh);
}

I get these results:

test bench::fasthash_SeaHasher64        ... bench:          67 ns/iter (+/- 3)
test bench::metrohash128                ... bench:          58 ns/iter (+/- 2)

TBH, I would be less comfortable with using SeaHash than with using MetroHash since SeaHash is even newer. There's no doubt that SeaHash is a great hash function but for this use case I prefer something more boring :)

michaelwoerister commented 7 years ago

@nagisa Thanks for those numbers! Looks like SipHash is the better choice then.