twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Use elements hash code in `RandomNextInt` generator #1914

Closed ttim closed 5 years ago

ttim commented 5 years ago

In twitter we use .shard to make reshuffle data. Sometimes most of the input files for .shard are extremely small (like 100 items) and since all mappers reuse random generator for shard all this items ends up on the same reducers and led to skew in output files.

In this review I changed RandomNextInt to xor with elements hashCode and therefore while better handle this kind of cases also be more or less deterministic.

athuras commented 5 years ago

the pain that lead to this was unspeakable.

ttim commented 5 years ago

@johnynek changed it to return from 0 to modulus