rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

SparkShufflePartitionSampleOperator may choose same partitions #8

Closed zkaoudi closed 7 years ago

zkaoudi commented 7 years ago

Currently the operator takes at random one partition and reads it sequentially in case it is used inside a loop. In case it reaches the end, it takes again a random one from all the partitions. This means that it may pick a partition that was already picked before. It's better to ensure that it will take a different one so that the entire data can be sampled eventually.

sekruse commented 7 years ago

Is that a bug, though? Or rather an enhancement? It's not violating any specification after all. :wink:

zkaoudi commented 7 years ago

Yes, I totally agree that is not exactly a bug. Just was playing with the various labels :P I will change it to enhancement 👍