Closed Lobo2008 closed 1 year ago
There is max bytes limit in shuffle server to protect the server, see https://github.com/uber/RemoteShuffleService/blob/master/src/main/java/com/uber/rss/execution/ShuffleExecutor.java#L81
You could change that value if your shuffle data exceeds that limit.
Thanks,I'll try it
Hi, @Lobo2008 Let us know as Bo mentioned, if the max app shuffle data size per server is the issue or not. You should see a RssTooMuchDataException in the stack trace.
If that's not the issue, please check
Hi @mayurdb
RssTooMuchDataException
ever happened, just RssNetworkException
DEFAULT_APP_MAX_WRITE_BYTES=3TB
is one stage shuffle size limitation or the accumulative size of all the shuffle write(?) stages for one application ? Stage-6 has 3TB but still works fine.I think that DEFAULT_APP_MAX_WRITE_BYTES is actually per server, so if you write 3TB of data but evenly distribute it to multiple servers you would not run into the issue
I think that DEFAULT_APP_MAX_WRITE_BYTES is actually per server, so if you write 3TB of data but evenly distribute it to multiple servers you would not run into the issue
I guess so.
Hi @mayurdb
- It's the latest version. I cloned and compiled the master branch in April 2022.
- no
RssTooMuchDataException
ever happened, justRssNetworkException
- I have re-run the app without change the size as Bo mentioned ( i'll try it later) and so far it runs well. I'll post the detail if the application finished or failed
- Wonder if the
DEFAULT_APP_MAX_WRITE_BYTES=3TB
is one stage shuffle size limitation or the accumulative size of all the shuffle write(?) stages for one application ? Stage-6 has 3TB but still works fine.
Finished successfully. But I found that the exception hit exception writing heading bytes
is caused by one or some of RSS running out of disk storage space.
Cool, glad you found the cause, and thanks for the update!
Running a 1TB~3TB Spark Application,it always failed after running several hours. blow is the Exception