Closed GoogleCodeExporter closed 9 years ago
I like it. It's a kind of stochastic criterion. Testers use these all the
time but
they don't always get the statistics right so the tests can be brittle or pass
for
the wrong reasons.
I think one important consideration will be efficiency. We'll want to generate
and
test lots of instances really fast. That means we probably don't want to
create a
new test step for each one.
Interestingly, I think this problem could be solved quite well using dataflow
analysis. Part of defining a good hash code is to ensure that all variations
in the
input correspond to a wide spread of variations in the hash code. So if we
could
track where each bit in the hash code came from, we could compute the
probability of
a given bit of the hash code being perturbed by a one bit change in an input.
Ideally all bits of the hash code would be equally likely to be perturbed in
this
way. Of course that's kind of an abstract definition (and I'm not sure it's
accurate). Sounds like something Pex could help with...
Original comment by jeff.br...@gmail.com
on 9 Apr 2010 at 9:10
Can't wait for v3.3 :)
Original comment by Yann.Tre...@gmail.com
on 9 Apr 2010 at 10:24
Just to formalize my thoughts.
I think that most of the developers (including myself) implements GetHashCode
in very subjective
way: shifting the bits left, shifting them right, xoring them together, or
shaking them a bit/ Then
they assume that the result "must" be OK without any objective proof, other
than "I remember having
seen Oren Skeet, Jon Hanselman, or another random Scott Eini doing that, so it
must be good enough!"
The truth is that we know it actually depends on data. An hash algorithm might
be good in some
scenario, and bad in other cases.
So the goal is to provide a convenient and easy way to verify the "efficiency"
of an implementation
against real world values. A simple approach is to let the user provide a
reasonably large set of
distinct real-world values to a dedicated contract verifier (either from an
exhaustive
catalog/database, or created with a randomly generated key). The contract
verifier should then
extract the hash code value from every instance, store them and compute useful
statistical metrics.
Finally it would run a few tests to verify that the metrics are in the expected
range. The limits
and thresholds could be changed by modifying the properties of the contract
verifier but we could
provide reasonable default values.
Some interesting metrics are:
Uniform Distribution - A simple approach could be to verify that each bit of
the 32-bits long hash
code is set in at least N% of the cases. It is probably necessary to be able to
exclude the highest
bits from the verification when the user knows that the number of possible
instances is low.
Collision Probability - A bit tricky because a value of 100% is perfectly
acceptable (and
unavoidable) when the possible number of instances is greater than 32^2.
However for a small sets of
possible values, we may want to reach 0%. We should perhaps not provide a
default value and force
the user to specify his own limit.
I/O bit perturbation - Difficult to implement because there is no easy way to
know inner details
about significant input data. The input object are possibly complex. It is
necessary to explore the
instances through reflection and find out a relation between fields/properties
and the resulting
hash code. Pex could be indeed of some help but are we then still in the scope
of MbUnit?
Any other ideas?
Original comment by Yann.Tre...@gmail.com
on 9 Apr 2010 at 9:03
Uniform Distribution - On the surface this sounds good. Given a sufficiently
large
number of uniformly distributed random input samples a good hash algorithm
should
probably yield uniformly distributed output values too. Testing at the bit
level
won't work though, particularly if the input set is relatively small. For
example,
given a hash code for a class with 5000 possible values is not going to cover
all bit
patterns evenly. Some bits will be more represented than others and some bits
might
never be used even in a good hash algorithm. Instead, just take a histogram of
hash
values and test for uniformity directly. Perhaps using a Chi-square test or
something.
Collision Probability - This is pretty straightforward. Suppose we test N
uniformly
distributed randomly chosen inputs and find M collisions. If the hash
algorithm is
good then it should produce a uniform distribution of hash values. Calculate
the
probability that would have found M collisions among a set of N uniformly
distributed
random numbers chosen from a domain of 2^32 numbers is. Set a threshold.
Bit Perturbation - Obviously testing this by dataflow analysis is not going to
fly.
Totally out of scope for MbUnit. :)
Original comment by jeff.br...@gmail.com
on 11 Apr 2010 at 8:55
Original comment by Yann.Tre...@gmail.com
on 2 Jul 2010 at 9:08
Original issue reported on code.google.com by
Yann.Tre...@gmail.com
on 9 Apr 2010 at 8:37