In twitter we use .shard to make reshuffle data. Sometimes most of the input files for .shard are extremely small (like 100 items) and since all mappers reuse random generator for shard all this items ends up on the same reducers and led to skew in output files.
In this review I changed RandomNextInt to xor with elements hashCode and therefore while better handle this kind of cases also be more or less deterministic.
In twitter we use
.shard
to make reshuffle data. Sometimes most of the input files for.shard
are extremely small (like 100 items) and since all mappers reuse random generator for shard all this items ends up on the same reducers and led to skew in output files.In this review I changed
RandomNextInt
to xor with elements hashCode and therefore while better handle this kind of cases also be more or less deterministic.