utterances-bot commented 4 years ago

Spark tips. Don't collect data on driver - Blog | luminousmen

Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. These speeds can be achievable using described tips.

https://luminousmen.com/post/spark-tips-dont-collect-data-on-driver

zizhaof commented 4 years ago

I don't understand the aggregateByKey and reduceByKey section. Do you want to explain that if we use aggregateByKey we can avoid using map and therefore get rid of temporary small files? And can you explain why map operation will create a lot of small files?

minnieshi commented 3 years ago

I think the 'reduceByKey' part is different than what spark recommends, https://spark.apache.org/docs/latest/rdd-programming-guide.html. there it says "groupByKey" can perform worse than reduceByKey or aggregateByKey. When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks.

luminousmen / luminousmen.com

https://luminousmen.com/post/spark-tips-dont-collect-data-on-driver #19

Spark tips. Don't collect data on driver - Blog | luminousmen