uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.
Other
321 stars 100 forks source link

Corrupted block detected during decompression #102

Open YutingWang98 opened 11 months ago

YutingWang98 commented 11 months ago

Hi, we are seeing some zstd corruption error during shuffle read recently.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 300 in stage 7.0 failed 4 times, most recent failure: Lost task 300.3 in stage 7.0 (TID 5866) (100.65.134.162 executor 200): com.github.luben.zstd.ZstdException: Corrupted block detected
    at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:216)
    at com.github.luben.zstd.Zstd.decompressByteArray(Zstd.java:409)
    at org.apache.spark.shuffle.rss.BlockDownloaderPartitionRecordIterator.fetchNextDeserializationIterator(BlockDownloaderPartitionRecordIterator.scala:178)

It seems not related to the input files since the spark job succeeded after we retry. Any ideas why and if this is related to rss client/server? Thanks