SparkLocalCallbackSink overhead issue

SparkLocalCallbackSink is collecting output through inputRdd.toLocalIterator() which could be optimal where execution will continue in the same executor while the local callback sink will eventually send all collected data to driver node, so inputRdd.collect() is more appropriate and also with running real workloads is proving that collect is more performant.

rheem-ecosystem / rheem

SparkLocalCallbackSink overhead issue #80