Language vs. implementation threat models and implications for TypeId collision resistance

rust-lang / rust

Empowering everyone to build reliable and efficient software.

https://www.rust-lang.org

Other

96.8k stars 12.5k forks source link

Language vs. implementation threat models and implications for TypeId collision resistance #129030

Open the8472 opened 1 month ago

the8472 commented 1 month ago

From #129014 and #10389

If you instead want to argue that the lang team should change its mind, please open a new issue and gather arguments in favor of that position, so that a summary of all the arguments for either option can be brought to the lang team for discussion.

I think we should take a step back and take a look at the bigger picture. I have tried to get people to spell out their threat model so that we can apply it as a razor to close/postpone all bug reports that are outside that model rather than making incoherent attempts to fix some things but leaving similar or more easily exploitable things open for the foreseeable future.

Previously the lang-team decided that it would be sufficient to use a 256bit cryptographic hash. If we assume – as #129014 does – that that statement was also meant as a lower bound required to satisfy T-lang's threat model then this has implications about what kind of things a language implementation should defend against. But afaict the rationale was not documented, so it remains unclear exactly against which threats an implementation should defend.

On the other hand it has been said in several places – e.g. in the context of build script sandboxing ¹ ², I-unsound issues, LLVM bugs and possibly incremental compilation – that rustc currently does not and cannot promise to be robust against malicious inputs.

So it seems like things are underspecified and there's tension between what the language should ideally offer and what implementations can offer. This in turn raises the question how the gap between the upper bound of what the spec allows/wants vs. what implementations offer should be handled.

It's also unclear to me to which extent this is a lang vs. a compiler decision.

Now to the concrete case of TypeId:

At first blush it seems that as long as the compiler does not try to resist malicious inputs on many fronts (proc macros, build.rs, compiler bugs, language bugs, llvm bugs) it does not make much sense that an actual TypeId implementation should spend much effort defending against that. And given the de-facto security levels that the compiler can offer a 128bit hash and only accounting for non-malicious inputs should remain sufficient for the moment, which would also be consistent with previous T-compiler decisions.

There might be some arguments that TypeId is different from other holes, e.g. as @briansmith tried here to gesture at some difficulty to detect exploits but there were counterpoints regarding the reliability and the possibility of such a scenario and we failed to reach agreement in subsequent discussion.

Imo the question What makes TypeId sit on one side of the security fence while other things sit on the other side? remains unanswered.

Perhaps it would be better if on the language level it is specified as globally unique value/exact comparison without mandating a particular implementation. That would leave the probabilistic reasoning, choice of comparison method etc. to the compiler. It could then decide to switch to a stronger implementation when it makes sense along with other hardening efforts.

RalfJung commented 1 month ago

In https://github.com/rust-lang/rust/issues/129016, we ask the concrete question of what the soundness expectations are for incremental compilation. It sounds like this issue tries to do the same but for at least half a dozen complicated discussions all at once. I find this much easier to discuss in concrete cases than in the super-abstract. Security is a trade-off, and specifically for TypeId if there are no noticeable downsides to achieving better collision resistance I see no reason not to do it.

For this issue here, I am not entirely sure what the question is. "T-lang, please define a threat model" is too open-ended to be a useful question to nominate -- when exploring wide-open design decisions, the usual process would be to schedule a t-lang design meeting, and the meeting document should sketch out the design space with at least one concrete proposal for how to answer the question. Given that you are asking for some very over-arching decision that includes everything from cargo implementation issues to compiler bugs, the document should spell out how a concrete threat model answers all these questions. Speaking of cargo, some of these decisions aren't even for t-lang to make, so this sounds like it would need a lot of cross-team coordination.

Personally I don't think we gain much from tying together the discussion around hash collisions in the compiler with build script sandboxing. That just makes it even harder than it already is to make progress on any one of these issues.

You are conflating a whole bunch of IMO unrelated things under the umbrella of "tension between what the language should ideally offer and what implementations can offer". However, "what the language should ideally offer" is extremely vague, and the sense in which the language "should" (possibly) offer build script sandboxing is very different from the sense in which rustc "should" not have I-unsound issues. There's a reason the lack of sandboxing is not labeled I-unsound: no language promise is broken by a build script that does odd things on your system, the Rust Abstract Machine remains entirely intact. (That won't help a developer whose system gets attacked that way, of course. But it is how we categorize these kinds of bugs in Rust and it has served us well, so I see no reason for suddenly fundamentally changing gears here.)

So as a first step to equip this issue with something like a reasonable scope I think it should focus on language-level promises and not drag in all potential security issues ever -- that kind of unscoped discussion has a very low chance of leading anywhere.

Now to the concrete case of TypeId:

I am still waiting to hear a single argument against just fixing the underlying problem here. In which world is it better to keep the current scheme instead of a collision-resistant hash function, no matter the threat model? If the proposed fix had significant downsides, that would be a different matter, but it doesn't. We can have philosophical discussions about security fences all day long, or we can just fix the issue...

For other cases there's much more of a trade-off, i.e. it seems clear that we can't "just" switch incremental compilation to use a collision-resistant hash. Similarly, for the questions around dlopen there are yet different tradeoffs. But the concrete costs and benefits depend heavily on the individual questions, and it seems rather hopeless to try to define a single general principle that will magically answer all of these questions. A case-by-case view (with suitable regard for related issues to avoid an incoherent result) works very well for us for other language design questions, I think it can work here as well.

Noratrieb commented 1 month ago

People like storing type IDs, so increasing the size again from 128 but to 256 bit would be an increase in memory usage and decrease in performance, which is a disadvantage

the8472 commented 1 month ago

A case-by-case view (with suitable regard for related issues to avoid an incoherent result) works very well for us for other language design questions, I think it can work here as well.

I think this suitable regard has not been applied here. Saying that cryptographic security is necessary to defend against malicious inputs while at the same time exploitable LLVM bugs are not a security issues seems inconsistent to me. There might be some tradeoff that implies a nuanced threat model underpinning that decision, but it was not communicated.

It just means the hypothetical well-resourced attacker would shift his attention from looking for hash collisions to fuzzing to find something he can use and spending resources on fixing one thing does not improve our overall security situation.

I am still waiting to hear a single argument against just fixing the underlying problem here. In which world is it better to keep the current scheme instead of a collision-resistant hash function, no matter the threat model? If the proposed fix had significant downsides, that would be a different matter, but it doesn't. We can have philosophical discussions about security fences all day long, or we can just fix the issue...

AIUI one of the proposed fixes is storing all the type names in a binary. This will increase binary sizes which is a real issue for embedded systems. If there's a fix that's free in terms of the produced output then I agree that it's not productive to argue further.

Personally I don't think we gain much from tying together the discussion around hash collisions in the compiler with build script sandboxing. That just makes it even harder than it already is to make progress on any one of these issues.

If they're outside the threat model then there's no point in making progress on any of them for the sake of security. They can still be valuable for e.g. reproducible builds and downloading prebuilt binary libraries, but if the only argument for them is to defend against threats we don't intend to defend against then that's wasted effort until there's a project goal to expand the threat model in a coherent manner.

For example consider the incremental compilation case. If we say that we don't consider malicious inputs there for performance reasons but do consider malicious inputs in release builds then what does that tell us against which threats we want and don't want to defend? Is the argument that compromising developer machines is worth the tradeoff while for release builds distributed to many people isn't? Should we then warn against using incremental = true in release profiles?

There's a reason the lack of sandboxing is not labeled I-unsound: no language promise is broken by a build script that does odd things on your system, the Rust Abstract Machine remains entirely intact. (That won't help a developer whose system gets attacked that way, of course. But it is how we categorize these kinds of bugs in Rust and it has served us well, so I see no reason for suddenly fundamentally changing gears here.)

I don't think soundness issues fall under the security policy either. Ok perhaps this is an angle we haven't discussed explicitly before: If T-lang says that TypeId must have at least cryptographic strength, then does that not imply we're trying to defend against malicious inputs here, and therefore it becomes a security policy issue rather than being only an I-unsound issue?

Or is it a project goal (Project Goal?) to become secure against malicious inputs in the future and so work should be done in that direction even though we don't promise that yet? But if that were the case then shouldn't it be discussed explicitly, especially considering some of those things are seemingly unfixable?

RalfJung commented 1 month ago

People like storing type IDs, so increasing the size again from 128 but to 256 bit would be an increase in memory usage and decrease in performance, which is a disadvantage

There are no plans to increase the size of TypeId. See https://github.com/rust-lang/rust/issues/129014 for details.

Saying that cryptographic security is necessary to defend against malicious inputs while at the same time exploitable LLVM bugs are not a security issues seems inconsistent to me.

It seems rather odd to me to say "you can't fix that soundness issue unless you also fix that other soundness issue".

We're constantly fixing non-soundness issues even though critical LLVM issues exist. There's nothing inconsistent about that. The same applies to https://github.com/rust-lang/rust/issues/129014 which is a soundness issue or just a bug, depending on whom you ask.

It just means the hypothetical well-resourced attacker would shift his attention from looking for hash collisions to fuzzing to find something he can use and spending resources on fixing one thing does not improve our overall security situation.

Not fixing issues we can fix certainly does not improve our overall security either.

This isn't a zero-sum game, we'll fix the issues we know we can fix now and slowly work towards fixing the ones that can't be fixed yet -- and we hope at some point LLVM will fix the ones that we can't fix in Rust. (Except that going in endless circles debating whether we need a "threat model" and what it should be surely takes away resources that could also be used to fix soundness issues...)

AIUI one of the proposed fixes is storing all the type names in a binary. This will increase binary sizes which is a real issue for embedded systems. If there's a fix that's free in terms of the produced output then I agree that it's not productive to argue further.

First of all, it'd only be the names of types that actually have their TypeId generated. The binary size impact is to be determined. Second, type names can be compressed, either losslessly or lossy (via a hash function).

But anyway, that's what #129014 is about, not this issue.

If they're outside the threat model then there's no point in making progress on any of them for the sake of security.

It seems odd to me that we wouldn't fix an issue we know how to fix, no mater what our "security policy" or "threat model" or any other document like that would say. Obviously it's a tradeoff, if the cost is too high we'll have to decide whether it's worth it, but no "security policy" will help us make that tradeoff -- the costs and benefits of both options need to be weighed in each individual case.

I don't think the "threat model" approach with a black-and-white distinction between "in scope" and "out of scope" is obviously applicable to our setting, nor do I think it is nearly as useful here as it is in other situations.

the8472 commented 1 month ago

It seems rather odd to me to say "you can't fix that soundness issue unless you also fix that other soundness issue".

The only difference between a 128bit and 256bit hash is getting the probability from "doesn't happen naturally" to "doesn't happen even with deliberate effort". So I see this not as a soundness fix but as an attempt to fend off attackers.

Or put differently, what exactly are we trying to fix if it's not attackers? A natural sort of... threat?

RalfJung commented 1 month ago

It's not just the hash size that changes, which you know of course.

I see this as a soundness fix, but it seems we have not only very different views but even different frameworks to base our arguments on. :shrug: It seems we have exhausted the amount of new arguments that will surface between the two of us. I'll wait and see what other people think, maybe someone will bring in new arguments.

the8472 commented 1 month ago

It's not just the hash size that changes, which you know of course.

If you're implying here that the essence of the soundness fix is not just changing the collision probability but also of the quality of the hash then I have not noticed any argument that siphash is inadequate for non-cryptographic use / for all the properties we would need for soundness barring malicious inputs.

Noratrieb commented 1 month ago

I think the point of this issue is that as-is, I think downcast is sound today. If you write safe code, no matter how bad, you will not hit any issues. The only way to exploit it is to write not only safe code but actively try really hard to ht this, with efforts on a scale unlike any other soundness issue we've had. It is impossible for any user to hit this in practice. Even with Rusts very strict interpretation of soundness, I see no point in saying that it's unsound. It would be way easier to just flip a few bits in the binary to cause unsoundness or edit the compiler to miscompile the binary, if I really wanted to cause unsoundness. The Rust language guarantees that TypeId is unique, and the compiler uses siphash to ensure this, following the usual compiler policy of not guarding against anyone actively malicious. (I hope this is at least in some ways a novel representation of the argument instead of going around in circles^^). I wouldn't oppose any improvement to TypeId as long as the downsides are small enough, but I also really don't think it's necessary under our current "thread models".

RalfJung commented 1 month ago

The only way to exploit it is to write not only safe code but actively try really hard to ht this, with efforts on a scale unlike any other soundness issue we've had.

That's not really true, the effort to construct collisions in our current hash function is not very high. By standards for cryptographic hash functions, SipHash-2-4 with an all-zero-key is weaker than MD5, and generating a collision for that is pretty easy these days. SipHash-1-3 is even weaker.

I think we can trust the cryptographic experts when they tell us that constructing a collision is not a significant computational effort, it's just a bunch of annoying work nobody has bothered to do yet.

RalfJung commented 1 month ago

I've nominated this for @rust-lang/lang discussion. I still think the issue is written in a much too broad way, making it hard to say anything concrete, but the discussion seems to focus on a reasonably concrete question: should the current TypeId hashing and incremental compilation scheme be considered sound, despite the fact that hash collisions can cause UB? Currently we are using a 128-bit hash, specifically SipHash-1-3 with an all-zero key.

The relevant facts and arguments seem to be:

SipHash even in its default configuration (SipHash-2-4) is not a cryptographic hash, as clarified by its author -- it is a pseudo-random function (PRF); that means it assumes a secret key, but Rust hard-codes an all-zero key. By standards for cryptographic hash functions, SipHash-2-4 with an all-zero-key is weaker than MD5, and generating a collision for that is pretty easy these days. SipHash-1-3 is even weaker. Previously, the lang team asked for the use of a "full (non-truncated) cryptographic hash"; the hash we currently use clearly does not qualify.
However, the currently used hash is a pseudo-random function if the key is randomly chosen. (Well, the function we use is weakened compared to its standard form of SipHash-2-4. But that's a separate question.) An argument could be made that if the code is "not malicious", it was picked without knowledge of the key, and therefore we can treat the hash function as a random function and use the birthday bound for estimating collision probabilities. I don't know and cannot judge the cryptographic validity of this argument.
Making TypeId use a stronger hash seems feasible without too much overhead (but we'll only know once we try), see https://github.com/rust-lang/rust/issues/129014.
Making incremental compilation use a stronger hash would incur significant performance cost, but there might be other options that could make exploiting hash collisions a lot harder, see https://github.com/rust-lang/rust/issues/129016.
(Programs loading Rust dylibs with dlopen are relying on yet another hash being unique, and that's just a 64bit hash, though it seems to only be critical to tell apart different crates with the same name. See https://github.com/rust-lang/unsafe-code-guidelines/issues/525 for more discussion. I think the other two soundness-critical uses of hashes are sufficient for this discussion, but wanted to list this for completeness.)

@briansmith and @tarcieri argue that the current situation is unsound since it is entirely computationally feasible to construct an unsound program, and therefore this is a soundness bug and should be fixed. Furthermore, even if we restrict ourselves to "naturally" occurring programs, a 128 bit hash is too small: "We shouldn't accept 1/2^64 rate of failure of memory safety by design as acceptable. (If we're operating on the assumption that an intentional 1/2^64 memory safety failure rate is OK, then let's put that in writing in the language reference specification; I believe such a change won't be accepted.)"

@the8472 and @Noratrieb say that we should not consider something a soundness issue if it takes specifically constructed malicious input to exploit the bug. "To phrase it differently. In our model we pretend that the inputs are generated by an actor that does neither know nor bother to optimize its input-generation process over the hash function or its key. Which means we can treat the situation as-if a random key was chosen." "At first blush it seems that as long as the compiler does not try to resist malicious inputs on many fronts (proc macros, build.rs, compiler bugs, language bugs, llvm bugs) it does not make much sense that an actual TypeId implementation should spend much effort defending against that. And given the de-facto security levels that the compiler can offer, a 128bit hash and only accounting for non-malicious inputs should remain sufficient for the moment, which would also be consistent with previous T-compiler decisions."

So, we'd like to know from the lang team: should the current implementation be considered sound, or should we strive for soundness even against examples constructed specifically to exploit the hash function? Is there an overarching threat model that we uniformly apply to all such questions (that could cover not only TypeId and incremental but also build scripts and would interact with LLVM's stance on malicious inputs), or are we doing case-by-case decisions?

If using the current approach is fine, a bunch of sub-questions arise:

How good of a PRF does it have to be -- if cryptographers tell us SipHash-1-3 is not a good PRF, should we use SipHash-2-4 instead? Can this be changed in the future to a hash that is not a cryptographic primitive in any sense, not even as a PRF?
Should we explore making the hash change over time, i.e. by incorporating the compiler version into the key or the initial state or so (I don't know SipHash well enough to judge what is possible here)? That would mean even if someone specifically constructs a collision, they would have to re-do that for each compiler version. This seems meaningful as "normal" soundness bugs stick around during compiler updates. We could hence say that TypeId collisions are not a soundness bug because they go away on each compiler update, providing a somewhat objective (if also somewhat shallow) way to distinguish this from "normal" soundness issues.
What kind of probabilities are acceptable for collisions? It seems hard to fix a concrete number here, but for the two main soundness-critical uses of hashes it boils down to "are 128bits enough". The current hash gives a 2^−64 chance of collision and hence potential UB when 6×10^9 things are being hashed ("things" being types for the TypeId issue, counted over the entire program; and query results for the incremental compilation issue, counted for each crate [I am not sure about the last point, someone please correct me if this is wrong]). I don't have numbers about the sizes of typical Rust programs to translate this into probabilities for actual code. It has also been suggested by @michaelwoerister that incremental builds should not be expected to live up to the same standards as release builds (with incremental turned off), so the answer here could be different for TypeId and incremental. Maybe t-compiler should have some lee-way here where a higher (but still low, if we assume non-malicious input) chance of soundness issues is acceptable in exchange for faster builds.

wesleywiser commented 1 month ago

Can we separate out the questions regarding TypeId and incremental compilation? TypeId is an interesting intersection of compiler, language and library design but incremental compilation is purely a compiler feature. I think the concerns and even the high-level threat models are different between these features.

RalfJung commented 1 month ago

The issue originally asked a question spanning TypeId and incremental compilation and build scripts and a ton of other things, so I already tried hard to narrow down its scope.^^ I didn't want the summary to be completely non-representative of the OP here.

But we can point out to t-lang that they might want to further subdivide this question, as you have just done.

traviscross commented 1 month ago

From above:

By standards for cryptographic hash functions, SipHash-2-4 with an all-zero-key is weaker than MD5.

Following the link, the claim is:

Siphash128 with a fixed key is a weaker hash function than MD5.

So, I assume you mean that finding collisions in SipHash-2-4-128 with an all-zero key is easier than finding collisions in MD5. Can you substantiate that?

The best attack on MD5 finds collisions in about 2^18 calls to the compression function.

RalfJung commented 1 month ago

So, I assume you mean to claim that finding collisions in SipHash-2-4-128 with an all-zero key is easier than finding collisions in MD5. Can you substantiate that claim?

I am relying on domain experts here, I am not a cryptographer. So that would be a question for @briansmith.

But given that even the author of SipHash explicitly states that SipHash is not a cryptographic hash function and, with a fixed key, therefore not even designed to be collision-resistant, I would say the burden of proof is on the side of anyone claiming that it is hard to find collisions in SipHash with a fixed key. MD5 was at least designed to be collision-resistant. We shouldn't by default assume that a function has properties that it was never designed to have.

traviscross commented 1 month ago

SipHash-2-4-128 is not collision resistant for an obvious reason. Its output size is only 128 bits. So by the birthday bound, it can't do better than approximately 2^64 security against collision attacks, which is not considered collision resistant these days.

(MD5 is fundamentally broken and falls far below the birthday bound.)

RalfJung commented 1 month ago

MD5 is fundamentally broken wrt a criterion that SipHash was never even intended to have. Without evidence to the contrary, SipHash should be treated like a non-cryptographic hash function.

EDIT: I am told that that's too naive of a view and there are features of SipHash that make it "obviously" a lot better than a non-cryptographic hash. I'll stay out of this sub-discussion and leave it to the crypto experts. :)

the8472 commented 1 month ago

I still think the issue is written in a much too broad way, making it hard to say anything concrete

Can we separate out the questions regarding TypeId and incremental compilation?

Not really. It isn't really about either of those specifically It's about classes of things we want to fix and which we don't or shouldn't because it doesn't make sense to spend effort on them. I have intentionally structured the issue as discussing threat models first and then the concrete case of TypeId. Threat models impact more than one issue because giving up on defending on one kind of threat makes it mostly pointless to defend against other ones with similar impact and difficulty. A simplifying analogy (don't pick too hard on it): You don't buy a bolted steel door but leave a ground floor window open and unguarded 24/7. You secure both or neither.

If we say

incremental compilation can't be considered secure for the moment
we do not loudly warn against profile.release.incremental=true
but release mode should be secured against malicious inputs (as in the case of TypeId)

then we have a lopsided defense, inconsistent decisionmaking or something along those lines. Similar lines can be constructed between other security-related decisions, e.g. LLVM's security stance is a big one here.

I'm trying to encourage looking at the bigger picture here to check if individual decisions make sense in context.

SipHash vs. MD5

I think this is somewhat of a distraction. For several reasons:

The choice of hashing function can be changed. It's an implementation detail. If siphash turns out weak we could replace it with another fast 128bit hash function and get the same collision resistance that I'm simply assuming for now.
Even the attacks against MD5 require dedicated searches targeting the hash function, not just random inputs. Which makes it a targeted attack, not a programmer happening to compile a lot of code.
You're talking about collision attacks. Those tend to be more difficult to exploit in meaningful ways than preimage attacks because the attacker needs to control both inputs for that and then get their victim to use both in a way that's exploitable.

So first we should decide what our security goals are, then what properties a hash function must have to achieve that, then whether the chosen hash function provides that desired level.

Should we explore making the hash change over time, i.e. by incorporating the compiler version into the key or the initial state or so

It's already fed in as input as part of the metadata hash. That probably has slightly different implications wrt. exploitability than feeding it in as key, but it is changing the result.

Furthermore, even if we restrict ourselves to "naturally" occurring programs, a 128 bit hash is too small: "We shouldn't accept 1/2^64 rate of failure of memory safety by design as acceptable. [...]"

Hrrm, I haven't checked the math. Isn't that only correct if you're generating a single huge program containing around 2^64 inputs, which would exhaust memory of any modern computer? If you're generating many smaller individual programs then they're only getting chances to collide within themselves which not only lowers the individual collision probability but also in aggregate.

Additionally there are hardware failure rates (such as memory bit flips) to consider which might even exceed those.

[...](If we're operating on the assumption that an intentional 1/2^64 memory safety failure rate is OK, then let's put that in writing in the language reference specification; I believe such a change won't be accepted

Natural failures are part of a threat model and the acceptable failure rate should be defined and then we should consider if that's achievable, yes.

RalfJung commented 1 month ago

It's already fed in as input as part of the metadata hash. That probably has slightly different implications wrt. exploitability than feeding it in as key, but it is changing the result.

Bjorn was talking about the crate version, I was talking about the compiler version. AFAIK currently our hash stays stable across compiler versions, but I may be wrong about that. (Also, I was talking all soundness-relevant hashes, not just TypeId.)

the8472 commented 1 month ago

https://github.com/rust-lang/rust/issues/10389#issuecomment-2245998853

[...] -Cmetadata, which in turn affects the TypeId hashes.

https://github.com/rust-lang/rust/issues/10389#issuecomment-1867351685

[...] Note that we do already include a hash of the full rustc version string though through StableCrateId. This includes the commit hash for the official builds and thus isn't predictable for future versions.

https://github.com/rust-lang/rust/blob/c416a6fc73657648b81670e5feadf5c7810aa31d/compiler/rustc_span/src/def_id.rs#L132-L134

bjorn3 commented 1 month ago

The compiler version is always hashed in by the compiler. The crate version is hashed in only when using cargo as build system, but any build system which allows multiple crates with the same name needs to hash in something unique between those crates through -Cmetadata.

RalfJung commented 1 month ago

Ah nice. (One of these comments you are quoting is 8 months older than the other, so it wasn't clear at all from your earlier message.)

I especially like the part about "includes the commit hash for the official builds and thus isn't predictable for future versions". :)

That said, this does not affect the hashes used for incremental builds, or does it?

bjorn3 commented 1 month ago

That said, this does not affect the hashes used for incremental builds, or does it?

It doesn't affect all hashes, but many queries directly or indirectly use DefId, which when getting StableHashed will hash the StableCrateId.

traviscross commented 1 month ago

Furthermore, even if we restrict ourselves to "naturally" occurring programs, a 128 bit hash is too small: "We shouldn't accept 1/2^64 rate of failure of memory safety by design as acceptable. [...]"

Hrrm, I haven't checked the math. Isn't that only correct if you're generating a single huge program containing around 2^64 inputs...

Here's the math:

#![feature(f128)]
pub fn est_log2_collision_prob2(m: u16, n: u16) -> f128 {
  let (m, n) = ((m as f128).exp2(), (n as f128).exp2());
  (1. - (-(n * (n - 1.)) / (2. * m)).exp()).log2()
}

Playground link

Assuming a 128-bit random oracle, the probability that we'll see a collision given N queries is no greater than:

N	Collision probability
`2^10` (~1k)	`2^-109`
`2^20` (~1M)	`2^-89`
`2^30` (~1B)	`2^-69`
`2^40` (~1T)	`2^-49`

That is, in our use case, a program that contains one million types has one chance in 2^89 of an accidental collision (given this random oracle).[^1]

[^1]: In terms of the non-accidental, after 2^64 queries we'll have a ~39% chance of having found a collision, and while you're not going to do 2^64 queries on your laptop, it's not considered cryptographically hard.

briansmith commented 3 weeks ago

Perhaps it would be better if on the language level it is specified as globally unique value/exact comparison without mandating a particular implementation. That would leave the probabilistic reasoning, choice of comparison method etc. to the compiler.

(That's what I suggested in point 1 in https://github.com/rust-lang/rust/issues/10389#issuecomment-2245924622.)

It is also what the rustdoc documentation for TypeId says already, as of today:

A TypeId represents a globally unique identifier for a type. Each TypeId is an opaque object which does not allow inspection of what’s inside but does allow basic operations such as cloning, comparison [...]

So the issue regarding TypeId is (and has always been) that Rust code may be (already) making safety and/or correctness arguments based the logic that TypeId::of::<T>() == TypeId::of::<U>() implies T == U, but the implementation doesn't seem to do what is documented.

But given that even the author of SipHash explicitly states that SipHash is not a cryptographic hash function and, with a fixed key, therefore not even designed to be collision-resistant, I would say the burden of proof is on the side of anyone claiming that it is hard to find collisions in SipHash with a fixed key.

Yes, this is my position.

When I mentioned MD5 before, here's what I meant: Look at the number of rounds in MD5, and compare that to the number of rounds in SipHash. Also, look at the complexity of each round in MD5, and compare that to the complexity of each round in SipHash. Clearly MD5 is doing MUCH more work than SipHash. Now take a look at the for SipHash; if SipHash had MD5-level security then it would be an inappropriate design for its performance/security goals, so it would be reduced in strength in its design until it is much weaker than MD5. So, it is extremely unlikely anybody is going to share a well-reasoned argument that SipHash-with-a-fixed-key is stronger than MD5 because it very likely isn't even close (indeed, such an argument could literally be somebody's entire PhD). Would we be happy with MD5-level collision resistance? I doubt it. So, don't waste time chasing this. In other words, the argument is intended to help people avoid wasting time going down the wrong path.

traviscross commented 3 weeks ago

When I mentioned MD5 before, here's what I meant: Look at the number of rounds.... Clearly MD5 is doing MUCH more work than SipHash.... So, it is extremely unlikely anybody is going to share a well-reasoned argument that SipHash-with-a-fixed-key is stronger than MD5 because it very likely isn't even close...

This argument is too simplistic. Even keyed MD5 is broken. The preimage security of MD5 is broken. Using it in HMAC (where MD5 is applied twice, with a key) is broken.[^1]

Analysis in 2014 found a chosen related key collision[^2] (i.e., not a "real" collision) in SipHash-2-d. If the authors could have easily found a fixed-key collision attack on SipHash-2-d below the birthday bound, I'm sure they would have reported that much more impressive result instead.[^3]

Meanwhile, we know from 2013 work that full MD5 collisions can be found on a graphing calculator.

Observe that SipHash has a 256-bit internal state (to 128-bit for MD5)[^4] and the message injection offers much less of the kind of control over that state that you want when trying to find internal collisions.[^5]

My feeling is that it would be difficult to support with cryptanalysis the claim that SipHash is weaker than MD5, and so we shouldn't be making confident-sounding statements to that effect, setting aside whether this even matters for the threat model here.

[^1]: That is, a blanket statement that SipHash is weaker than MD5 implies that SipHash does not uphold its PRF or MAC security claims. But those claims are unbroken in the literature.

[^2]: This means they were free to pick two arbitrary keys for the two arbitrary messages. This gives much more freedom and is therefore a lot easier than having to commit to a fixed key, especially since key bits touch all internal state bits directly whereas message bits do not.

[^3]: This isn't to say it's not possible, obviously. Just that it doesn't seem trivial.

[^4]: Collisions are generally found by finding internal collisions -- that is, collisions of the internal state. All else being equal, a larger internal state makes this harder. A 256-bit internal state moves the birthday bound for internal collisions to 2^128.

[^5]: Note also, when counting operations, that SipHash acts on 64-bit words (to 32-bit for MD5) and that it injects only 8 bytes per compression (to 64 bytes for MD5).

Noratrieb commented 3 weeks ago

Why exactly are we spending precious time arguing about whether SipHash is more secure than MD5? It's clear that neither of the two are cryptographically secure, and that's not the point of the issue. The point is whether we want cryptographic security in the first place. The precise algorithm is an implementation detail that doesn't matter. It's just asking if it's good enough if there are no collisions on normal input, or if it should be collision resistant against adversarial input too. And this question is answered by spelling out the threat model here.

traviscross commented 3 weeks ago

I tend to agree. However, you had earlier mentioned:

The only way to exploit it is to write not only safe code but actively try really hard to ht this, with efforts on a scale unlike any other soundness issue we've had.

As reflected in @RalfJung's comment here, the difference between 2^64 (the birthday bound) and ~2^18 (MD5 security) matters to some people for this line of argument, even if 2^64 is not itself cryptographically significant.

That's not an argument I would make. But if this does matter to people, then it's worth questioning the claim.

Noratrieb commented 3 weeks ago

I think even if TypeId was MD5 (which it obviously will never be), my point still stands. 2^18 is, as the name implies, security. Against adversarial inputs. Which I did explicitly not consider to be in scope. For people not explicitly trying to attack it, the bound is much higher (maybe the birthday bound? maybe something a bit lower? I'm not sure about that). If you think that our use of SipHash is not secure against Rust programmers not actively trying to crack it (and we know that sufficiently bad code can often get close to adversarial), then we should it course change it.

Personally, I would argue that threat models against the compiler are a T-compiler issue, so I think this shouldn't involve T-lang at all, other than saying that TypeId must obviously be unique. How and against what the compiler ensures this should be up to T-compiler.

traviscross commented 3 weeks ago

For people not explicitly trying to attack it, the bound is much higher (maybe the birthday bound? maybe something a bit lower? I'm not sure about that).

The math on that is here. Assuming SipHash-c-d-128 is indistinguishable from a PRF, the probability of a random program with an absurd 1M types hitting a natural collision is one in 2^89 (i.e. "never").

carbotaniuman commented 3 weeks ago

I would just like to mention that C++ uses more than a simple non-cryptographic hash for ensuring the safety of things like it's dynamic_cast. And C++ is not known for it's safety. Using SipHash, a hash not intended for collision avoidance and with arguable properties simply is a rash decision.

I would also like to point out that we in the past have taken steps to let pathological (massive, autogenerated code based on human would ever deign to write) code compile more easily, so we should not be intentionally needing our security and making it potentially unsound.

We do not argue against miscompilations (no matter how rare or tortured) as being outside of our "threat model", whatever that means. TypeId collisions are unquestionably valid Rust programs that misbehave.

If I were to come up with a "threat model", I would say that Rust should not miscompile (weird but no malicious) Rust programs and crates. A build script that deleted your hard drive and installs a rootkit is the Rust compiler exacting running the code you told it to, even if you would rather it didn't. A valid program with some weird TypeIds is just that a valid program that we miscompile, and hence unsound.

traviscross commented 2 weeks ago

If I were to come up with a "threat model", I would say that Rust should not miscompile (weird but no malicious) Rust programs and crates.

Note that when we talk about collision resistance, we mean specifically resistance against the intentional (i.e. malicious).

If you're not worried about malicious programs and crates, you don't need to worry about colliding TypeIds[^1]. The probability of these naturally colliding, even in weird or otherwise pathological programs, can be safely rounded down to zero.

[^1]: Note that this does not imply the inverse. If you are worried about malicious programs and crates, you may still not need to worry about colliding TypeIds. It seems difficult to work out a plausible attack.

carbotaniuman commented 2 weeks ago

That would depend on whether our hash of choice has any pathologies that may cause certain patterns to exhibit more collisions no?

For example, I would be completely fine with a truncated Blake hash, because I am fairly confident in my belief that are no pathological cases there.

traviscross commented 2 weeks ago

There are no known such pathologies in SipHash[^1], and there are good reasons to believe that no such significant pathologies may exist.

SipHash is a cryptographically-strong pseudo-random function (PRF). That is, it claims (with recommended parameter choices) that, given some random fixed secret key, its behavior is indistinguishable (for a bounded adversary) from that of an actually random function.

This claim is unbroken in the literature and has comfortable security margins for the recommended parameter choices.

If the output of SipHash exhibited any patterns with higher-than-chance probability, that would break the claim of PRF security.[^2] That is, finding any such pattern would be an enormous advance compared to the current state-of-the-art cryptanalysis.

[^1]: Again, setting aside intentional attacks, e.g. collision-finding, using knowledge of the fixed key.

[^2]: Note that this claim does go beyond the accidental. The attacker isn't allowed to know the key, but is allowed to pick arbitrary inputs and get the output for each, and is allowed to dynamically alter these inputs to try to find some pattern.

RalfJung commented 2 weeks ago

FWIW we're currently not using the recommended parameter choices. We're using SipHash-1-3 instead of SipHash-2-4.

the8472 commented 2 weeks ago

So I would suggest that for the purpose of threat model discussion we round SipHash's property's off to "a statistically ideal 128bit hash function, not robust against adversarial inputs".

The choice of hash function is an implementation detail and can be changed later if it's found to fall short. Or if a better alternative is found.

SipHash is interesting as an informative example around speed-quality tradeoffs. It sits between "obviously too weak" and "obviously correct". Practically it's of course also relevant because it's the current implementation, but in principle we could replace it with something in the same category, not just with something in a more secure category.

This could be the order of decisions to make:

Decide how much T-lang wants to delegate to T-compiler / how much of a gap between spec and implementation is allowable. Previously the question was brought to T-lang, that's why I tagged it as such. Edit: Actually, it was originally brought to both teams. (and github is terrible at linking to it):
The relevant team then settles on a threat model. Considering the existing open barn doors and the costs of treating things as security issues, does it make sense to defend against all or some types of adversarial inputs? If not today, should it be made official as a long-term goal? Are 128bits enough (see collision probabilities for various input sizes) for non-malicious inputs?
Check if SipHash still makes sense under that model

Independently there can be quality of implementation work, finding alternative approaches that do better than the required minimum as long as we don't have to pay significant penalties for them.

traviscross commented 2 weeks ago

FWIW we're currently not using the recommended parameter choices. We're using SipHash-1-3 instead of SipHash-2-4.

Yes. And if that worried us enough, that would be trivial to change.

Probably I don't find it that worrying, but I'd also probably be interested in a representative benchmark with e.g. SipHash-1-5-128, SipHash-2-4-128, etc. to see what it would cost to remove some caveats. Having numbers on the statistical distribution of the length of the inputs used to compute TypeIds would be helpful.

Some analysis:

In SipHash-1-3-128 (what we use), the last 7 bytes of input go through only 4 rounds to emit the first 64 bits of output. Then 3 more rounds happen to emit the last 64 bits of output.

To give some intuition, here's what a one-bit change looks like propagating through each of 6 SipHash rounds (this visualizes the internal 256-bit state):

siphash

We can see immediately that 3 rounds doesn't achieve full diffusion. At four rounds it's hard to eyeball, but we know from cryptanalysis that there are practical distinguishers. These rely on carefully crafted inputs.

But things get an awful lot harder with each additional round. There are no reported distinguishers on five rounds.

So, as it applies to us with SipHash-1-3-128, I'd say the first 64 bits of the TypeId are perhaps a bit weak cryptographically with respect to the last 7 bytes of input. But note that there are only 2^56 possible last 7 bytes (mapping to 2^64 possible outputs), and we're not making adversarial queries.

Since 3 rounds obviously isn't enough for a secure pseudo-random permutation, I'd also suspect there may exist some theoretically detectable correlations between the first 64 bits of output and the second 64 bits of output, though these could still be pretty hard to find in context.

The most cost-effective way to increase confidence would be increasing the number of finalization rounds. SipHash-1-5-128 would have many fewer caveats, at a cost of 4 total additional rounds per message.

Still, given the amount of headroom here, it would surprise me to ever see a SipHash-1-3-128 TypeId collision in a program that was not the result of intentional collision-finding.