uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.
Other
321 stars 100 forks source link

what may cause RssInvalidServerVersionException? #94

Open Lobo2008 opened 1 year ago

Lobo2008 commented 1 year ago

Hi, I am wondering:

Q1. if RssInvalidServerVersionException will occur when RSS-i is restarted by a shell script as soon as it crashes due to some reasons meanwhile some applications are still using it. clients still stores the former RSS-i version but actually the version of the newly registered RSS-i is already changed.

# also the other exception may be caused by the same reason?
org.apache.spark.shuffle.FetchFailedException: Detected server restart, current server: Server{rss04.xxx:12203, 1675897753258, rss04xxx:/data/}, previous server: Server{rss04.xxxx:12203, 1675895945858, rss04xxx:/data/} at org.apache.spark.shuffle.RssShuffleManager$$anon$2.resolveConnection(RssShuffleManager.scala:220) at com.uber.rss.clients.ServerConnectionCacheUpdateRefresher.refreshConnection(ServerConnectionCacheUpdateRefresher.java:49) at com.uber.rss.clients.ServerIdAwareSyncWriteClient.connectImpl(ServerIdAwareSyncWriteClient.java:133) at

Q2. What may cause this exception :

org.apache.spark.shuffle.FetchFailedException: Cannot fetch shuffle 0 partition 362 due to RssAggregateException (RssShuffleStageNotStartedException (Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxx44973 -> /10.20xxx:12212 (1xxxx28)])
com.uber.rss.exceptions.RssShuffleStageNotStartedException: Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxxx:44973 -> /10.2xxx12212 (10.xxxx)]
    at com.uber.rss.clients.ClientBase.checkOKResponseStatus(ClientBase.java:291)
    at com.uber.rss.clients.ClientBase.readResponseStatus(ClientBase.java:275)
    at ...
mayurdb commented 1 year ago

Q1 You are right. This happened because server restarted and client had initially connected to earlier server. Ideally should not be an issue. Maybe we can remove this check @hiboyang ?

Q2 That basically means the server you are trying to connect to has not yet received the shuffle data for corresponding partition (Identified using appId, appAttemptId, shuffleId). Is this also happening when the server restarted?

hiboyang commented 1 year ago

Previously RSS does not handle server restart well, thus adding those check. Feel we could remove it.