uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.
Other
321 stars 100 forks source link

write amplification #69

Open cpd85 opened 2 years ago

cpd85 commented 2 years ago

i'm noticing running some spark apps that produce 11TB of shuffle data on external shuffle service, that they produce closer to 18TB of shuffle data on remote shuffle service. is some write amplification expected?

hiboyang commented 2 years ago

It may depend on how these metrics are calculated. Remote shuffle service does write some extra data for each shuffle record like task attempt id and partition id to track the record. But sometime, the metics may be also off a little bit due to serialization/compressing.

cpd85 commented 2 years ago

got it. looks like compression isn't supported at the moment on server side? my workloads tend to stress out the SSD and not use computation so I think they could benefit from compression. I see this class https://github.com/uber/RemoteShuffleService/blob/7220c23694e0175e01719621707680a2718173cf/src/main/java/com/uber/rss/common/Compression.java but as far as I can tell it it isn't actually used or configurable