Adds fault tolerance in RSS servers for one or more server going away. This is how the functionality works
Node/server goes away
Task reading/writing data from that server fail
Code throws a fetch failed exception in both read/write flow
If its a shuffle map stage, patch added in spark takes care of rolling back the stage completely and retry the required stages: if fetch fail is thrown when reading shuffle data, retry current and parent stage, if fetch fail is thrown when writing shuffle data, retry only the current stage
A stage retry hook is added in the patch which gets triggered before the stage is retried. Hook's implementation in RSS calls the registerShuffle again for this particular shuffle
A new list of available servers is picked
As in the normal flow, set of RSS servers to be used is sent to mappers and reducers as part of shuffle handle
In the spark patch, new interface is added for the stage retry hook. I won't be able to add UTs without these changes in spark binary. Maybe we can upload a fat jar in the repo for that.
Also, there is a patch added in open source for rolling back shuffle map stage in Spark 3.0, I haven't yet evaluated that. Maybe we can make use of it to avoid the long changes here. I'll evaluate and get back on that.
Adds fault tolerance in RSS servers for one or more server going away. This is how the functionality works
In the spark patch, new interface is added for the stage retry hook. I won't be able to add UTs without these changes in spark binary. Maybe we can upload a fat jar in the repo for that.
Also, there is a patch added in open source for rolling back shuffle map stage in Spark 3.0, I haven't yet evaluated that. Maybe we can make use of it to avoid the long changes here. I'll evaluate and get back on that.