uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.
Other
321 stars 100 forks source link

[WIP] Add tolerance in RSS cluster for server going away #97

Open mayurdb opened 1 year ago

mayurdb commented 1 year ago

Adds fault tolerance in RSS servers for one or more server going away. This is how the functionality works

In the spark patch, new interface is added for the stage retry hook. I won't be able to add UTs without these changes in spark binary. Maybe we can upload a fat jar in the repo for that.

Also, there is a patch added in open source for rolling back shuffle map stage in Spark 3.0, I haven't yet evaluated that. Maybe we can make use of it to avoid the long changes here. I'll evaluate and get back on that.

hiboyang commented 1 year ago

Thanks @mayurdb for the PR!