rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

Question about PartitionSampleFunction #82

Closed runzhiwang closed 5 years ago

runzhiwang commented 5 years ago

@zkaoudi Hi, I'm sorry to bother you. I do not know why "list.add(element)" is necessary: https://github.com/rheem-ecosystem/rheem/blob/master/rheem-platforms/rheem-spark/src/main/java/org/qcri/rheem/spark/operators/SparkRandomPartitionSampleOperator.java#L225. And when "there are cases were the list will be smaller because of miscalculation of the partition size for mini-batch" will happen ? Thank you very much.

zkaoudi commented 5 years ago

@runzhiwang We may miscalculate the partition size when, for example, we cannot exactly divide the number of tuples with the number of partitions (Line 98-99) and for example the last partition contains less tuples than what we calculated. In this case, it may happen that the tid that we randomly pick does not exist or that there are not enough tuples to fill up the list. So as a turn-around we add at least the last element of the partition in the list, so that the list is never empty. This means that the operator may return a smaller sample than the one requested in cases of partition size miscalculation and needs to be fixed.

runzhiwang commented 5 years ago

@zkaoudi I see, Thank you very much.