ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
982 stars 330 forks source link

[Feat][RayCluster] Introduce the RayClusterStatus.Conditions field #2214

Open rueian opened 3 days ago

rueian commented 3 days ago

Why are these changes needed?

This is the first part of the process to refine the Ready/Failed status in RayClusterStatus https://docs.google.com/document/d/1bRL0cZa87eCX6SI7gqthN68CgmHaB6l3-vJuIse-BrY

In this first PR, we introduce the new RayClusterStatus.Conditions field, borrowed from the idea of k8s Deployment https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542

Related issue number

Checks

andrewsykim commented 2 days ago

@rueian @kevin85421 can we have the API change and implementation in the same PR if possible? Otherwise, if we release v1.2 before the implementation we'd have to revert this PR.

kevin85421 commented 2 days ago

can we have the API change and implementation in the same PR if possible? Otherwise, if we release v1.2 before the implementation we'd have to revert this PR.

Do you mean adding HeadReady and RayClusterReplicaFailure only when we implement them? cc @andrewsykim

andrewsykim commented 2 days ago

Do you mean adding HeadReady and RayClusterReplicaFailure only when we implement them?

Yes, and also the addition of the new conditions field in status. We should avoid changing the CRD / API until there is an working / tested minimum implementation. What do you think?

andrewsykim commented 2 days ago

@kevin85421 on second thought, I think incremental PRs is fine as long as it's feature gated. I added initial feature gate implementation in https://github.com/ray-project/kuberay/pull/2219, what do you think?

rueian commented 1 day ago

I am ok with the feature gate or waiting for the implementation of HeadReady and RayClusterReplicaFailure.

We should avoid changing the CRD / API until there is an working / tested minimum implementation.

Hi @andrewsykim, do we still need to keep the old CRD / API if the gate is disabled? I thought that could only be achieved by introducing a new CRD version.