Rheem's cost functions do not consider how much load ExecutionOperators can bear. For instance, SparkCollectionSource uses sc.parallelize(...) under the hood, which should be used only for small amounts of data. However, in contrast to writing data to HDFS and then reading it to Spark (JavaFileSink + SparkFileSource), it is relatively fast. Therefore, Rheem might prefer the former option even for larger datasets.
An easy way to disallow this is to restrict the number of data quanta a certain operator may process. This way, we can make Rheem choose more robust plans.
Rheem's cost functions do not consider how much load
ExecutionOperator
s can bear. For instance,SparkCollectionSource
usessc.parallelize(...)
under the hood, which should be used only for small amounts of data. However, in contrast to writing data to HDFS and then reading it to Spark (JavaFileSink
+SparkFileSource
), it is relatively fast. Therefore, Rheem might prefer the former option even for larger datasets.An easy way to disallow this is to restrict the number of data quanta a certain operator may process. This way, we can make Rheem choose more robust plans.