rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

Create MapPartitionsOperator #14

Closed luckyasser closed 7 years ago

luckyasser commented 7 years ago

From @sekruse on July 13, 2016 9:20

Although not being a necessity, the "map partitions" operation is commonly offered across various data processing platforms - often for performance reasons. It takes a UDF partition -> partition and is eligible to mix a transformation with a combine step, amongst others. For that reason, it seems desirable to offer an according counterpart in Rheem. So far, there is no notion of partitions Rheem, but I suspect that most platforms have compatible notions that we can re-use in Rheem. A partition could be defined as any "locally sequentially accessible, non-repeating series of data quanta, where the partitions of a data set are pairwise disjoint and their union forms the complete dataset". This applies to both streams as well as distributed collections. Some corner cases need more thorough thought, though. For instance, can an empty dataset have no partition? If so, can there be more than one empty partition? Maybe one per machine?

Copied from original issue: daqcri/rheem#6