uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.
Other
321 stars 100 forks source link

shuffle read error #54

Closed D3077 closed 3 years ago

D3077 commented 3 years ago

After RSS is enabled for Spark, the shuffle read data displayed on the stage page is inconsistent with the shuffle write data. branch spark30.

D3077 commented 3 years ago

shuffle read 427.2M,shuffle write 3.6G。

D3077 commented 3 years ago

This issue occurs when the --num-executors, --executor-cores is small.

mayurdb commented 3 years ago

It will be great if you can share a small sample application to reproduce this. I will also try to reproduce this on my end

D3077 commented 3 years ago

Thank you for your reply.

spark-submit --master yarn --deploy-mode cluster \ --driver-memory 5g --num-executors 5 --executor-cores 1 --executor-memory 10g \ --conf spark.jars=hdfs:///tmp/shuffletest/remote-shuffle-service-0.0.9-client.jar \ --conf spark.executor.extraClassPath=remote-shuffle-service-0.0.9-client.jar \ --conf spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager \ --conf spark.shuffle.rss.serviceRegistry.type=zookeeper \ --conf spark.shuffle.rss.serviceRegistry.zookeeper.servers=$host:$port \ --conf spark.shuffle.rss.dataCenter=dc1 \ --conf spark.speculation=false \ --conf spark.shuffle.rss.replicas=1 \ --class com.example.operator.Test \ /opt/test/operator-1.0.jar

When a Spark task is submitted in cluster mode and stage information is queried on the Spark UI, the size of shuffle read is smaller than that of shuffle write.

Thank you.

D3077 commented 3 years ago

The mergeShuffleReadMetrics processing should be added to the read operation of the RssShuffleReader.