Collisions in type_id - Githubissues

DaGenix commented 10 years ago

The implementation of type_id from #10182 uses SipHash on various parameters depending on the type. The output size of SipHash is only 64-bits, however, making it feasible to find collisions via a Birthday Attack. I believe the code below demonstrates a collision in the type_id value of two different ty_structs:

use std::hash;
use std::hash::Streaming;

// I believe that this pretty much the same thing as hash_crate_independent() in ty.rs
// for a ty_struct on a 64-bit platform
fn hash_struct(local_hash: &str, node: u64) -> u64 {
    let mut state = hash::default_state();
    state.input([18]);
    state.input(local_hash.as_bytes());
    do node.iter_bytes(true) |bytes| { state.input(bytes); true };
    state.result_u64()
}

fn main() {
    // This represents two structures with different node values from a crate with a 
    // local_hash value of "local" that end up getting the same hash and thus, 
    // I think, the same type_id
    assert!(hash_struct("local", 0x9e02c8943c336302) == hash_struct("local", 0x366a806b1d5f1b2));
}

alexcrichton commented 10 years ago

I'm not entirely sure how feasible it is for a program to have 0x366a806b1d5f1b2 node ids (2.5 trillion), but this is still concerning.

We could in theory have very cheap inequality among types, and then have an expensive equality check. Something which may walk the respective TyDesc structures in parallel to make sure that they're the same. We could also bump up the hash size to using something like sha1/sha2 and have the type_id() intrinsic return [u8, ..N] to reduce the possibility of a collision.

Either way, I don't think that this is a super-pressing issue for now, but I'm nominating to discuss whether we want to get this done for 1.0. This could in theory have serious implications depending on how frequently Any is used.

alexcrichton commented 10 years ago

Ah, it was already nominated!

bill-myers commented 10 years ago

Why not compare an interned version of the type data string? (i.e. what is currently passed as data to be hashed, possibly SHA-256 hashed first)

The linker can be used for interning by emitting a common symbol with the type data string as name and taking its address, and otherwise the same thing can be done manually in a global constructor.

This way it's always a pointer comparison, and there are no collisions.

DaGenix commented 10 years ago

I don't know how node id values are generated, but assuming that they are generated sequentially, this particular collision is not realistic. However, its not hard to find collisions for more realistic node id values by picking particular values for the crate hashes:

assert!(hash_struct("a2c55ca1a1f68", 4080) == hash_struct("138b8278caab5", 2804));

The key thing to consider isn't the number of node id values, though: its the total number of type id values. Some quick (hopefully correct) math shows that there is a 0.01% chance of a collision once there are around 60 million type id values. That's still a pretty large number of type id values for a somewhat low probability of a collision, thought. So, its unclear to me how big a deal this is for the Rust 1.0 timeframe. It all depends on what the acceptable probability of a collision is.

nikomatsakis commented 10 years ago

When I saw that @alexcrichton proposed using a hash, my first reaction was "collision!" but then I thought "...but exceedingly unlikely to occur in practice". I think this is not a matter of imminent destruction but if we can leverage the linker or some other scheme to avoid this danger, we should -- and perhaps we should just go ahead and mark the current scheme as deprecated and just plan on finding a replacement scheme.

thestinger commented 10 years ago

A cryptographic hash designed for this purpose (larger output) would be enough. Although, a larger output would be more expensive to compare (four u64 comparisons for SHA2).

pnkfelix commented 10 years ago

We don't need to deal with this right now. P-low.

steveklabnik commented 9 years ago

How relevant is this issue today? I think that it's all the same, but am not sure.

thestinger commented 9 years ago

It's 64-bit so collisions are likely with enough types (consider recursive type metaprogramming) and it doesn't have any check to bail out if one occurs. Bailing out is not a very good solution anyway, because it pretty much means that there's no way to compile the program, beyond using a different random seed and hoping for the best. It's a crappy situation.

vks commented 9 years ago

Note that "hoping for the best" by iteratively changing the seed might work with overwhelmingly large probability after very few iterations.

sorear commented 8 years ago

use std::any::Any;

fn main() {
    let weird : [([u8; 188250],[u8; 1381155],[u8; 558782]); 0] = [];
    let whoops = Any::downcast_ref::<[([u8; 1990233],[u8; 798602],[u8; 2074279]); 1]>(&weird);
    println!("{}",whoops.unwrap()[0].0[333333]);
}

Actually a soundness issue. playground: http://is.gd/TwBayX

pnkfelix commented 8 years ago

I'd like the lang team to devote a little time to this now that we are post 1.0. Nominating

nikomatsakis commented 8 years ago

OK, lang team discussed it, and our conclusion was that:

This issue ought to be fixed, it's silly not to.
This is an implementation detail that we could change whenever we want (right?)
Nonetheless, we probably ought to open an RFC or at least a discuss thread, with a proposal to do better, since probably people will have some clever ideas.
Probably the runtime overhead of the virtual call in the Any trait is way more than a strcmp anyhow for all realistic types.

nikomatsakis commented 8 years ago

I was wondering about a design where we do something like:

generate a static string representing the full type; in static builds, at least, this will be interned by the linker;
generate a hash

compare the string pointers for equality (to give a fast equality check). If that fails, compare the hashes for inequality (to give a fast inequality check). If THAT fails, compare the strings for content (to handle dynamic linking).

Although re-reading the thread I see @bill-myers may have had an even more clever solution.

pnkfelix commented 8 years ago

@nikomatsakis putting the hash of the data at the start is a good idea, to increase the probability that we catch unequal things quickly. It seems to me like @bill-myers' approach composes fine with that strategy.

sorear commented 8 years ago

I doubt the "problem" is limited to Any. You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things. What is the objective here? Since Rust is not a sandbox language, I don't think "protect memory from malicious programmers" should be one of our goals (we should document the types of undefined behavior that can be hit in safe code, and fix the ones that are possible to hit by accident; if someone is determined to break the type system, they can already write an unsafe block, or use std::process to launch a subprocess that ptraces its parent and corrupts memory).

hmvp commented 7 years ago

Thanks to: https://www.reddit.com/r/rust/comments/5pfwjr/mitigating_underhandedness_clippy/dcrew0k/ https://is.gd/Xb7L5r

This example works on Beta and Nightly.

Mark-Simulacrum commented 7 years ago

@nikomatsakis Should this be marked as I-unsound? I've done so for now, since that seems to be the general conclusion a couple of times by different people, but please unmark if I'm wrong.

withoutboats commented 6 years ago

You can probably confuse the compiler just as effectively by colliding hashes for symbol mangling, or many other things.

Would any of these result in incorrect runtime behavior, or just bizarre compiler errors?

EDIT: The comment I'm replying to is over 2 years old.

nikomatsakis commented 6 years ago

Would any of these result in incorrect runtime behavior, or just bizarre compiler errors?

It could be unsound, if it fooled Any into an incorrect downcast.

withoutboats commented 6 years ago

@nikomatsakis Right, I meant sorear's other examples (name mangling and such). The fact that this can cause a soundness bug separates it from other mishashes which would just result in a compiler error.

SoniEx2 commented 5 years ago

eventbus crate uses a lot of Any, as it encourages you to use a lot of different event types and emulated inheritance. this could become an issue...

SoniEx2 commented 5 years ago

What if, instead of making TypeId bigger, we add another implementation detail that works like eventbus?

Calls to Any, TypeId, etc are converted into a set of static usize that get initialized at runtime.

https://internals.rust-lang.org/t/static-generics/8734?u=soni

we can then use an expensive check (string lookup/comparison) once and use a cheap check every other time.

strega-nil commented 5 years ago

Why not use a pointer comparison of an inline symbol? VTables should work well, if we remove the thing where vtables can be distinct across TUs.

(in C++)

template <typename T>
struct VtableForAny {
  std::size_t size;
  std::size_t align;

  void (*destroy)(void *);
};
template <typename T>
inline VtableForAny<T> VTABLE_FOR_ANY = {
  sizeof(T),
  alignof(T),
  T::~T(),
};

struct Foo {
};

template <typename T>
T *downcast(AnyRef any) {
  if (any.vtable == &VTABLE_FOR_ANY<T>) {
    return static_cast<T*>(any.ptr);
  } else {
    return nullptr;
  }
}

We could do a similar translation for Rust, and this is guaranteed to work by LLVM.

RalfJung commented 5 years ago

if we remove the thing where vtables can be distinct across TUs.

I think that would not be easy. It's not just TUs; if you create a trait object in a const then it will have a vtable generated by miri, and I am not sure if that will get deduplicated with the one generated for run-time use.

hanna-kruppe commented 5 years ago

@ubsan

if we remove the thing where vtables can be distinct across TUs.

We can't, for the same reason generic statics couldn't be unique across TUs if we added them (dynamic linking on Windows, possibly other platforms).

strega-nil commented 5 years ago

@rkruppe ah, yeah, forgot about dynamic linking. It's likely we'd have to do a strcmp then, in the case where addresses are distinct, so you'd likely get a fast path and a slow path.

@ralfjung it's absolutely possible and, in fact, quite easy. It just requires a simple LLVM attribute.

hanna-kruppe commented 5 years ago

What strings do you want to compare? Lacking a global namespace, the only way I know of to get a "type name" that won't be the same across different crates is using the crate disambiguator, which is a hash, which is therefore susceptible to the same sort of collision. Or can we arrange for such collisions always resulting in a linker error, even when the crates are only linked together dynamically and indirectly?

strega-nil commented 5 years ago

@rkruppe the name of the type, I think? That should be globally unique.

crate_name::foo::bar::Baz

maybe crate_name:v0.1.0::foo::bar::Baz

hanna-kruppe commented 5 years ago

Crate names are not globally unique, and neither are (crate name, version) tuples.

strega-nil commented 5 years ago

@rkruppe huh! then I guess whatever information we pass to the hash function :P

hanna-kruppe commented 5 years ago

The disambiguator is passed in using the -C metadata argument, rustc doesn't know where it comes from. Arguably that places it outside of the domain of rust and thus collisions of it aren't our soundness problem, but let's consider Rust workflows more broadly. Typically it's computed as a hash by Cargo, and AFAIK that hashing includes a whole ton of things that are impractical to put into the binary.

SoniEx2 commented 5 years ago

You can work around the generic statics issue the same way eventbus does it currently.

Language support would be an improvement over using a macro, tho.

SoniEx2 commented 5 years ago

So, first, for fixing the collisions:

struct TypeId(&'static [&'static [&'static str]]);

Why this specific layout? Three reasons:

We want an identifier that's consistent across builds, so a string.
We want an identifier that's relatively lightweight, so we want to reuse substrings, or a slice of strings.
We want to handle generics, which contain identifiers, so a slice of slices of strings.

So, for example, we can end up with some of the following TypeIds:

TypeId(&[&["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "Handlers<T>"], &["rustc-1.30", "cargo-1.30", "ircclient-0.1.0", "IrcLineEvent"]])
TypeId(&[&["rustc-1.30", "cargo-1.30", "std-1.30", "result", "Result<T, E>"], &["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "Handlers<T>"], &["rustc-1.30", "cargo-1.30", "ircclient-0.1.0", "IrcMessageEvent"], &["rustc-1.30", "cargo-1.30", "eventbus-0.4.0", "MyError"]])

Note that these are only examples and the real thing probably wouldn't look like this, but this shows the basic idea.

It's still possible to intentionally create collisions, ofc, but that's different from accidental collisions (what we're trying to prevent here).

Also note that these are still opaque, so the contents need not be unambiguous to humans, only to the compiler (or the debugger, if we want that...).

thepowersgang commented 5 years ago

@rkruppe With your comment about the metadata string, did you mean that it's "safe" to use it as a disambiguator between different versions of the same crate (be it versions or feature-sets), or that we shouldn't use it (because with non-cargo workflows it may not be passed).

Storing the type "name" and comparing that (potentially with a short hash for a fail-fast option) and trusting the -C metadata argument to be sufficient for disambiguation sounds like it could be the best option for reducing the chance of collisions.

hanna-kruppe commented 5 years ago

I didn't really mean to recommend any course of action, just list constraints. One of them is that the metadata string is itself a hash in the typical workflow and thus could have collision.

But thinking about the options you gave, it seems to me that: if we are OK with a neglegible chance for collisions, we can achieve that (in principle, I don't know how large the collision risk is today) with a single hash. Well, assuming collisions in the crate disambiguator can't be detected -- I am not sure whether there is perhaps some clever use of the linker to detect them.

SoniEx2 commented 5 years ago

I have tricks to improve performance, but they need to come after this

(am I even being understood? I have no idea if ppl can understand me but they often seem to ignore me)

eduardosm commented 5 years ago

What about keeping somewhere a "cache" of calculated TypeIds? That cache would store a TypeId-type pair for each type for which TypeId::of::<T> has been called. If a collision is detected, another TypeId is calculated using a hash table collision resolution algorithm. So, TypeIds would be the index in a hash table with 2^64 buckets. With this approach, we can continue to use a 64-bit integer as TypeId (or maybe it could be reduced to 32-bit). However, this means that, if the cache is deleted, all crates should be compiled again.

We could also forbid TypeId zero, so size_of::<Option<TypeId>>() == size_of::<TypeId>().

icefoxen commented 5 years ago

A... well, not resolution, but partial mitigation might be achieved just by calculating the TypeId of all types in a compilation run and making sure they're unique. This doesn't solve the problem but can at least make it noisy for an end-user instead of possibly silently failing. Doesn't help the networked/IPC use case like eventbus much, but it's something.

SoniEx2 commented 5 years ago

eventbus is not networked/IPC. the idea was to use multiple dll's/so's.

dlight commented 5 years ago

Why doesn't type_id use a perfect hash function like rust-phf?

In computer science, a perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions.

bjorn3 commented 5 years ago

Why doesn't type_id use a perfect hash function like rust-phf?

Because it already has to decide on type ids during codegen of one crate, at which point type ids from upstream crates are not yet known.

dlight commented 5 years ago

Because it already has to decide on type ids during codegen of one crate, at which point type ids from upstream crates are not yet known.

Perhaps type_id should have some bits for a crate id, and some bits for the per-crate phf. This way you ensure there is no collisions.

tarcieri commented 5 years ago

Depending how nuts you want to go with it, you could use the package checksum (truncated to 128-bits) as e.g. a SipHash key, with the crate name and the rest of the type information as the SipHash input.

No idea how hard it'd be to make rustc aware of package checksums if it isn't already, but it should suffice to eliminate collisions so long as the names of crates within the same package are unique.

The output size of SipHash is only 64-bits

Note that there's a variant of SipHash with a 128-bit output as well (although I'm guessing an implementation thereof is not available in-tree)

tema3210 commented 4 years ago

Depending how nuts you want to go with it, you could use the package checksum (truncated to 128-bits) as e.g. a SipHash key, with the crate name and the rest of the type information as the SipHash input.

No idea how hard it'd be to make rustc aware of package checksums if it isn't already, but it should suffice to eliminate collisions so long as the names of crates within the same package are unique.

The output size of SipHash is only 64-bits

Note that there's a variant of SipHash with a 128-bit output as well (although I'm guessing an implementation thereof is not available in-tree)

But usage of more than 64 bit will change size of 'TypeId' struct. That's the worse. I like idea of making crate id bits a part of TypeId. If we change the size of TypeId to 128 bits, then we can make first 64 bit be a crate id and second to crate local typeid which,in that case, can be built buy perfect hash function.

Y0ba commented 4 years ago

I hit this bug when tried to create type map (HashMap<TypeId, _>) to store global data for some types. Since static variables with generic parameters from outer functions are not allowed (this pattern was used in previous version of app written in C#) type map was the only option. Types were auto-generated from existing Protobuf-files and there were few hundreds of them. Thankfully I was able to catch collision in test where I put all types inside map and checked if they exist.

And I think this behavior (potential collisions) should be documented in TypeId and Any docs.

withoutboats commented 4 years ago

Probably the best approach to your problem would have been to generate a separate static for each type, rather than generating a type map. A type map only really makes sense when the set of types in it can only be determined at runtime.

Do you know how many types were generated as part of this process? You say "few hundred" but I assume you mean protobuf files; the chance of a collision in a map with only a few hundred types should be very low.

Y0ba commented 4 years ago

Probably the best approach to your problem would have been to generate a separate static for each type

That's what I had in C#, and at first tried in Rust something like this:

fn get_data<T>() -> &'static mut Data<T> {
    static mut DATA: Data<T> = Data::new();
    unsafe { &mut DATA }
}

but it doesn't work. Type map was a nice solution without additional codegen. But at the end I did just like you said - forked protobuf compiler and added separate static for each type in generated files.

Do you know how many types were generated as part of this process?

Probably around 400. In protobuf-file there were way more, but in type map I used only a small subset of them.

the chance of a collision in a map with only a few hundred types should be very low.

Indeed. I hit it accidentally and didn't even expect that.

withoutboats commented 4 years ago

You can't have generic statics but you can generate a static of a different type for each type, and then associate the static with the type using a static method on a trait.

So I believe the upper limit on a 64 bit hash would give a probability of this occurring at less than 1 in 8.3 million (512 out of 2^32). Obviously SipHash is not designed to be collision resistant and the probability is greater than this, but I can't find any good number as to how much greater. If the collision resistance is so low that a user has a reasonable chance of encountering a collision among less than 500 TypeIds, we should prioritize a resolution to this higher than we have - mainly because its an awful footgun as you've mentioned.

Does anyone have better numbers on the lack of collision resistance in SipHash than the birthday bound?

tarcieri commented 4 years ago

From the SipHash paper, the major concerns around collision resistance are the small output size and correspondingly small birthday bound as you noted:

https://www.131002.net/siphash/siphash.pdf

Security is also limited by the output size (64 bits). In particular, when SipHash is used as a MAC, an attacker who blindly tries 2^s tags will succeed with probability 2^(s−64)

The paper further notes:

We comment that SipHash is not meant to be, and (obviously) is not, collision resistant.

(this is due to the small output size)

rust-lang / rust

Collisions in type_id #10389