ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.16k stars 373 forks source link

[RayCluster][Status][1/n] Remove ClusterState Unhealthy #2068

Closed kevin85421 closed 5 months ago

kevin85421 commented 6 months ago

Why are these changes needed?

We can think of RayCluster as a collection of K8s ReplicaSets. In this case, RayCluster status should be similar to ReplicaSetStatus. The ReplicaSetStatus doesn't check the Pod commands which belong to the data plane logic. The Ray command validation should be done by Ray instead of KubeRay in my opinion.

Related issue number

Checks

kevin85421 commented 6 months ago

cc @Yicheng-Lu-llll would you mind reviewing this PR? Thanks!

Yicheng-Lu-llll commented 5 months ago

lgtm!

I also manually test the case ray start command is incorrect:

ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl get pod
# NAME                                          READY   STATUS             RESTARTS      AGE
# kuberay-operator-8b57f8d84-klv75              1/1     Running            0             2m36s
# raycluster-kuberay-head-56cfw                 0/1     CrashLoopBackOff   3 (32s ago)   115s
# raycluster-kuberay-worker-workergroup-8p9np   0/1     Init:0/1           0             115s

ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl logs raycluster-kuberay-head-56cfw
# Usage: ray start [OPTIONS]
# Try 'ray start --help' for help.

# Error: No such option: --appple

# The ClusterState will not be showed.
ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl describe raycluster
# Status:
#   Desired CPU:              2
#   Desired GPU:              0
#   Desired Memory:           3G
#   Desired TPU:              0
#   Desired Worker Replicas:  1
#   Endpoints:
#     Client:     10001
#     Dashboard:  8265
#     Metrics:    8080
#     Redis:      6379
#     Serve:      8000
#   Head:
#     Pod IP:             10.244.0.6
#     Service IP:         10.96.249.54
#   Last Update Time:     2024-04-05T23:36:52Z
#   Max Worker Replicas:  3
#   Min Worker Replicas:  1
#   Observed Generation:  1
kevin85421 commented 5 months ago

@andrewsykim Does https://github.com/ray-project/kuberay/pull/2068#discussion_r1556366008 make sense to you? Thanks!

andrewsykim commented 5 months ago

Yes, we should add a field in HeadInfo to indicate whether the head Pod is ready or not as I mentioned in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073.

Removing states in ClusterState or even the entire field as suggested in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073 is a breaking change I would consider fine for alpha or beta APIs, but not a v1 API.

Adding new fields in HeadInfo is fine, but removing possible values of ClusterState or the ClusterState field is breaking compatibiltiy

andrewsykim commented 5 months ago

My personal suggest is to first add the new field in HeadInfo and then re-evaluate whether removing unhealthy cluster state is necessary. Can we instead just have ClusterState = unhealthy reflect the value of the state in HeadInfo without removing it?

kevin85421 commented 5 months ago

My personal suggest is to first add the new field in HeadInfo and then re-evaluate whether removing unhealthy cluster state is necessary. Can we instead just have ClusterState = unhealthy reflect the value of the state in HeadInfo without removing it?

Thank you for raising awareness about compatibility. I agree that we should be careful regarding the compatibility of ready. However, unhealthy is a different case. You can take a look at the definition of unhealthy; it pertains only to a very niche case involving rayStartParams. rayStartParams is typically empty for most users. In the 1.5 years that I have been working on KubeRay, I haven’t encountered the codepath related to unhealthy or seen any user running into it.

andrewsykim commented 5 months ago

However, unhealthy is a different case. You can take a look at the definition of unhealthy; it pertains only to a very niche case involving rayStartParams

Thanks, I missed this part. I agree it's fine to remove if it only pertains to validating rayStarParams. But https://github.com/ray-project/kuberay/pull/1930 proposes expanding the definition of unhealthy as well, which I think is valid. The definitions of ready and unhealthy are connected, because "not ready" is inherently "unhealthy". See my latest comment (https://github.com/ray-project/kuberay/pull/1930#issuecomment-2043811551) for why I think it's fine to change the definition of ready, which I think also applies to unhealthy.

kevin85421 commented 5 months ago

I think it's fine to change the definition of ready

I replied in https://github.com/ray-project/kuberay/pull/1930#issuecomment-2044163617 to explain why it is a breaking change and how it may create "unknown unknowns" for users, as I mentioned in https://github.com/ray-project/kuberay/pull/1930#issuecomment-2043298363. If we can't reach a consensus, we can have a chat next week.

andrewsykim commented 5 months ago

Thanks @kevin85421 -- thinking about this more I think this PR is good to merge based on the current definition of unhealthy. We can always re-add it after discussion in https://github.com/ray-project/kuberay/pull/1930 and before v1.2