rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

Restrict operators to certain workload sizes #66

Closed sekruse closed 7 years ago

sekruse commented 7 years ago

Rheem's cost functions do not consider how much load ExecutionOperators can bear. For instance, SparkCollectionSource uses sc.parallelize(...) under the hood, which should be used only for small amounts of data. However, in contrast to writing data to HDFS and then reading it to Spark (JavaFileSink + SparkFileSource), it is relatively fast. Therefore, Rheem might prefer the former option even for larger datasets.

An easy way to disallow this is to restrict the number of data quanta a certain operator may process. This way, we can make Rheem choose more robust plans.