scala / bug

Scala 2 bug reports only. Please, no questions — proper bug reports only.
https://scala-lang.org
230 stars 21 forks source link

MurmurHash3 produces the same hash for lists of repeated values, regardless of list length #12826

Open boeseaw opened 11 months ago

boeseaw commented 11 months ago

Reproduction steps

Scala version: 2.13.2 (UPDATE: also 2.13.11)

scala> println(Seq(17, 17).hashCode)
-1314960578
scala> println(Seq(17, 17, 17).hashCode)
-1314960578

Problem

The implementation of MurmurHash3.listHash attempts to make the hashes of lists that are equivalent to a range the same as that of the range. It does this by recording the difference between the first and second item in the list, and then testing whether the difference between subsequent consecutive pairs in the list is the same or not. When all differences between consecutive pairs are the same, it recognizes this as a sequence and falls back on rangeHash, which only accounts for the first and last values in the "range", and the difference, which it treats as the step size of the range.

The current implementation does not correctly account for the case when the elements in the list are all the same. The difference between hashes of consecutive elements in this case is zero, and thus an invalid step size for a range. But since the difference is consistent between all pairs, it is still treated as a range for hashing purposes. With the first and last elements being the same regardless of the size of this list, this means that the hash will be the same regardless of the list size.

This seems like incorrect behavior as it violates the expectation that the algorithm produces well-distributed hashes, and the logical assumption that two lists with different lengths are unlikely to produce the same hash, regardless of specific elements they contain. The algorithm should be modified to differentiate sequences of the same element so that they are not treated as potential ranges. I believe that simply checking that rangeDiff is not 0 may be sufficient.

Although the implementations vary, arrayHash, orderedHash, and indexedSeqHash all have similar optimizations, and appear to likely have the same issue.

Ichoran commented 11 months ago

I think you're right that the rangeDiff == 0 check would be sufficient. I guess in listHash the easiest fix would be on line 293 to change rangeState = 2 to rangeState = if (rangeDiff == 0) 3 else 2.