Closed kevin85421 closed 5 months ago
cc @Yicheng-Lu-llll would you mind reviewing this PR? Thanks!
lgtm!
I also manually test the case ray start command is incorrect:
ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl get pod
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-8b57f8d84-klv75 1/1 Running 0 2m36s
# raycluster-kuberay-head-56cfw 0/1 CrashLoopBackOff 3 (32s ago) 115s
# raycluster-kuberay-worker-workergroup-8p9np 0/1 Init:0/1 0 115s
ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl logs raycluster-kuberay-head-56cfw
# Usage: ray start [OPTIONS]
# Try 'ray start --help' for help.
# Error: No such option: --appple
# The ClusterState will not be showed.
ubuntu@ip-172-31-38-87:~/others/kuberay$ kubectl describe raycluster
# Status:
# Desired CPU: 2
# Desired GPU: 0
# Desired Memory: 3G
# Desired TPU: 0
# Desired Worker Replicas: 1
# Endpoints:
# Client: 10001
# Dashboard: 8265
# Metrics: 8080
# Redis: 6379
# Serve: 8000
# Head:
# Pod IP: 10.244.0.6
# Service IP: 10.96.249.54
# Last Update Time: 2024-04-05T23:36:52Z
# Max Worker Replicas: 3
# Min Worker Replicas: 1
# Observed Generation: 1
@andrewsykim Does https://github.com/ray-project/kuberay/pull/2068#discussion_r1556366008 make sense to you? Thanks!
Yes, we should add a field in HeadInfo to indicate whether the head Pod is ready or not as I mentioned in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073.
Removing states in ClusterState or even the entire field as suggested in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073 is a breaking change I would consider fine for alpha or beta APIs, but not a v1 API.
Adding new fields in HeadInfo is fine, but removing possible values of ClusterState or the ClusterState field is breaking compatibiltiy
My personal suggest is to first add the new field in HeadInfo and then re-evaluate whether removing unhealthy
cluster state is necessary. Can we instead just have ClusterState = unhealthy
reflect the value of the state in HeadInfo
without removing it?
My personal suggest is to first add the new field in HeadInfo and then re-evaluate whether removing unhealthy cluster state is necessary. Can we instead just have ClusterState = unhealthy reflect the value of the state in HeadInfo without removing it?
Thank you for raising awareness about compatibility. I agree that we should be careful regarding the compatibility of ready
. However, unhealthy
is a different case. You can take a look at the definition of unhealthy
; it pertains only to a very niche case involving rayStartParams
. rayStartParams
is typically empty for most users. In the 1.5 years that I have been working on KubeRay, I haven’t encountered the codepath related to unhealthy
or seen any user running into it.
However, unhealthy is a different case. You can take a look at the definition of unhealthy; it pertains only to a very niche case involving rayStartParams
Thanks, I missed this part. I agree it's fine to remove if it only pertains to validating rayStarParams. But https://github.com/ray-project/kuberay/pull/1930 proposes expanding the definition of unhealthy
as well, which I think is valid. The definitions of ready
and unhealthy
are connected, because "not ready" is inherently "unhealthy". See my latest comment (https://github.com/ray-project/kuberay/pull/1930#issuecomment-2043811551) for why I think it's fine to change the definition of ready
, which I think also applies to unhealthy
.
I think it's fine to change the definition of
ready
I replied in https://github.com/ray-project/kuberay/pull/1930#issuecomment-2044163617 to explain why it is a breaking change and how it may create "unknown unknowns" for users, as I mentioned in https://github.com/ray-project/kuberay/pull/1930#issuecomment-2043298363. If we can't reach a consensus, we can have a chat next week.
Thanks @kevin85421 -- thinking about this more I think this PR is good to merge based on the current definition of unhealthy. We can always re-add it after discussion in https://github.com/ray-project/kuberay/pull/1930 and before v1.2
Why are these changes needed?
We can think of RayCluster as a collection of K8s ReplicaSets. In this case, RayCluster status should be similar to ReplicaSetStatus. The ReplicaSetStatus doesn't check the Pod commands which belong to the data plane logic. The Ray command validation should be done by Ray instead of KubeRay in my opinion.
Related issue number
Checks