vrakeshreddy / mb-unit

Automatically exported from code.google.com/p/mb-unit
0 stars 0 forks source link

New contract verifier to verify resistance to hash code collision. #651

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Implementing a good hash code generator for a given type might be tricky. 
Basically, the idea is to provide to a dedicated contract verifier a large 
amount of values and to verify that the hash codes do not collide too 
often.

[VerifyContract]
public readonly IContract HashCodeCollisionTest = new 
HashCodeCollisionContract<Foo>
{
    Tolerance = 0.01, // = 1% of collisions tolerated. If higher, the test 
will fail.
    Instances = new DistinctInstanceCollection<Foo>(GetManyFooInstances()),
};

private IEnumerable<Foo> GetManyFooInstances()
{
  // Provide as many distinct Foo instances as possible.
}

Does it make sense?
What do you think?
Any suggestion?

Original issue reported on code.google.com by Yann.Tre...@gmail.com on 9 Apr 2010 at 8:37

GoogleCodeExporter commented 9 years ago
I like it.  It's a kind of stochastic criterion.  Testers use these all the 
time but
they don't always get the statistics right so the tests can be brittle or pass 
for
the wrong reasons.

I think one important consideration will be efficiency.  We'll want to generate 
and
test lots of instances really fast.  That means we probably don't want to 
create a
new test step for each one.

Interestingly, I think this problem could be solved quite well using dataflow
analysis.  Part of defining a good hash code is to ensure that all variations 
in the
input correspond to a wide spread of variations in the hash code.  So if we 
could
track where each bit in the hash code came from, we could compute the 
probability of
a given bit of the hash code being perturbed by a one bit change in an input. 
Ideally all bits of the hash code would be equally likely to be perturbed in 
this
way.  Of course that's kind of an abstract definition (and I'm not sure it's
accurate).  Sounds like something Pex could help with...

Original comment by jeff.br...@gmail.com on 9 Apr 2010 at 9:10

GoogleCodeExporter commented 9 years ago
Can't wait for v3.3 :)

Original comment by Yann.Tre...@gmail.com on 9 Apr 2010 at 10:24

GoogleCodeExporter commented 9 years ago
Just to formalize my thoughts.

I think that most of the developers (including myself) implements GetHashCode 
in very subjective 
way: shifting the bits left, shifting them right, xoring them together, or 
shaking them a bit/ Then 
they assume that the result "must" be OK without any objective proof, other 
than "I remember having 
seen Oren Skeet, Jon Hanselman, or another random Scott Eini doing that, so it 
must be good enough!"

The truth is that we know it actually depends on data. An hash algorithm might 
be good in some 
scenario, and bad in other cases.

So the goal is to provide a convenient and easy way to verify the "efficiency" 
of an implementation 
against real world values. A simple approach is to let the user provide a 
reasonably large set of 
distinct real-world values to a dedicated contract verifier (either from an 
exhaustive 
catalog/database, or created with a randomly generated key). The contract 
verifier should then 
extract the hash code value from every instance, store them and compute useful 
statistical metrics. 
Finally it would run a few tests to verify that the metrics are in the expected 
range. The limits 
and thresholds could be changed by modifying the properties of the contract 
verifier but we could 
provide reasonable default values.

Some interesting metrics are:

Uniform Distribution - A simple approach could be to verify that each bit of 
the 32-bits long hash 
code is set in at least N% of the cases. It is probably necessary to be able to 
exclude the highest 
bits from the verification when the user knows that the number of possible 
instances is low.

Collision Probability - A bit tricky because a value of 100% is perfectly 
acceptable (and 
unavoidable) when the possible number of instances is greater than 32^2. 
However for a small sets of 
possible values, we may want to reach 0%. We should perhaps not provide a 
default value and force 
the user to specify his own limit.

I/O bit perturbation - Difficult to implement because there is no easy way to 
know inner details 
about significant input data. The input object are possibly complex. It is 
necessary to explore the 
instances through reflection and find out a relation between fields/properties 
and the resulting 
hash code. Pex could be indeed of some help but are we then still in the scope 
of MbUnit? 

Any other ideas?

Original comment by Yann.Tre...@gmail.com on 9 Apr 2010 at 9:03

GoogleCodeExporter commented 9 years ago
Uniform Distribution - On the surface this sounds good.  Given a sufficiently 
large
number of uniformly distributed random input samples a good hash algorithm 
should
probably yield uniformly distributed output values too.  Testing at the bit 
level
won't work though, particularly if the input set is relatively small.  For 
example,
given a hash code for a class with 5000 possible values is not going to cover 
all bit
patterns evenly.  Some bits will be more represented than others and some bits 
might
never be used even in a good hash algorithm.  Instead, just take a histogram of 
hash
values and test for uniformity directly.  Perhaps using a Chi-square test or 
something.

Collision Probability - This is pretty straightforward.  Suppose we test N 
uniformly
distributed randomly chosen inputs and find M collisions.  If the hash 
algorithm is
good then it should produce a uniform distribution of hash values.  Calculate 
the
probability that would have found M collisions among a set of N uniformly 
distributed
random numbers chosen from a domain of 2^32 numbers is.  Set a threshold.

Bit Perturbation - Obviously testing this by dataflow analysis is not going to 
fly. 
Totally out of scope for MbUnit.  :)

Original comment by jeff.br...@gmail.com on 11 Apr 2010 at 8:55

GoogleCodeExporter commented 9 years ago

Original comment by Yann.Tre...@gmail.com on 2 Jul 2010 at 9:08