oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
293 stars 66 forks source link

Changes for executor actor recovery using ray's fault tolerance. #391

Closed KiranP-d11 closed 6 months ago

KiranP-d11 commented 7 months ago

Pull request for the bug described in the issue https://github.com/oap-project/raydp/issues/364.

This is the fix for ray executors not recovering from OOM and other failures. The issue is because of the race condition:

 - Executor E1 dies lets say because of OOM
 - We try to kill it by firing stop call on E1 actor
 - Since the actor is not available, the stop task fails for E1
 - In the mean while, ray brings up the lost executor E1
 - The failed task (stop task) gets retried as there are task retries configured.
 - The stop task gets fired on the new executor which got recovered
 - The Recovered executor exits with status as user intended exit.
kira-lin commented 6 months ago

Hi @KiranP-d11 This is great. Thanks for your work! I have already merged the pr which fixes raydp-submit, can you please merge the main branch and try CI again? The file changes LGTM to me

KiranP-d11 commented 6 months ago

@kira-lin Merged the master branch and CI checks are passing.