rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

SparkLocalCallbackSink overhead issue #80

Open atroudi opened 6 years ago

atroudi commented 6 years ago

SparkLocalCallbackSink is collecting output through inputRdd.toLocalIterator() which could be optimal where execution will continue in the same executor while the local callback sink will eventually send all collected data to driver node, so inputRdd.collect() is more appropriate and also with running real workloads is proving that collect is more performant.