As we're closing to support multiple clusters, our use of ray up / ray down / ray-on-golem start / ray-on-golem stop intensifies, we are encountering new problems. In the case of head node creation failure, when running a fresh ray up, the kinda intuitive way is to call ray up again. The problem is that the webserver has an existing "corrupted" state, and retrying ray up is not making any progress. The user needs to know that a manual call to ray-on-golem stop is required to proceed. Let's address that.
As Ray does not have a concept of the cluster as we do, we can tie our idea to the fate of the head node - as in Ray head node plays the role of a central single point of state.
In the case of failure in the head node setup, webserver needs to clean up the whole cluster state, to be ready for the next ray up call.
As we're closing to support multiple clusters, our use of
ray up
/ray down
/ray-on-golem start
/ray-on-golem stop
intensifies, we are encountering new problems. In the case of head node creation failure, when running a freshray up
, the kinda intuitive way is to callray up
again. The problem is that the webserver has an existing "corrupted" state, and retryingray up
is not making any progress. The user needs to know that a manual call toray-on-golem stop
is required to proceed. Let's address that.As Ray does not have a concept of the cluster as we do, we can tie our idea to the fate of the head node - as in Ray head node plays the role of a central single point of state.
In the case of failure in the head node setup, webserver needs to clean up the whole cluster state, to be ready for the next
ray up
call.