fault tolerance of restarting server

uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.

Other

323 stars 100 forks source link

fault tolerance of restarting server #70

Open cpd85 opened 2 years ago

cpd85 commented 2 years ago

I'm a little confused, shouldn't a server using local storage that restarts be able to accept/handle downloads if the service can restart in 15 seconds or so, saving its state? it looks like on restart, the server will update its 'running version' to the current time so any existing write clients will fail at this line of code

https://github.com/uber/RemoteShuffleService/blob/7220c23694e0175e01719621707680a2718173cf/src/main/java/com/uber/rss/clients/ServerIdAwareSyncWriteClient.java#L149

cpd85 commented 2 years ago

@hiboyang just wanted to check if you might have context on the above?

hiboyang commented 2 years ago

Hi @cpd85 , this is due to a special case inside our previous environment, where when each server restarts, the behavior is unpredictable. To be safe, we do not want RSS client to read data from restarted server, to prevent reading bad data. Thus we update the running version.

This is a lacking feature in current RSS to gracefully handle server restart. If you are interested, welcome to contribute this feature.

cpd85 commented 2 years ago

@hiboyang definitely planning on contributing this feature. Could you elaborate on 'unpredictable' issue you saw? I think if the localstatestore has good status on what has been committed succesfully, it should be okay to restart right?

Also curious how you deploy code/config changes if a server restart causes such issues

hiboyang commented 2 years ago

The unpredictable issue is mostly related to the internal environment at that time. Kind of hard to explain.

It is better to redesign the server restart/recover feature, look forward to your contribution here.

By the way, are you running in YARN or Kubernetes?

cpd85 commented 2 years ago

@hiboyang running on aws ec2 for now, we don't have the infra at the moment to support kubernetes unfortunately. will let you know when i have a diff ready for restart/recover!

mayurdb commented 2 years ago

Hi @cpd85, looking forward to your contribution on gracefully handling server restarts.

In our internal deployments we were facing an issue of the node hosting RSS server going down much more frequently than restarts. We have added a fault tolerance for such cases by triggering a stage retry and picking a new list of RSS servers to execute a shuffle. That patch will help in preventing failures because of server restart cases as well. I'll try to port the feature as soon as possible.

YutingWang98 commented 1 year ago

@mayurdb Hi mayurdb! We also have this server down/restart issue quite frequently. Do you mind sharing your progress on the stage retry and new server list picking, or how you implemented it?