ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
974 stars 330 forks source link

[RayCluster][Fix] Add expectations of RayCluster #2150

Open Eikykun opened 1 month ago

Eikykun commented 1 month ago

Why are these changes needed?

This PR attempts to address issues #715 and #1936 by adding expectation capabilities to ensure the pod is in the desired state during the next Reconcile following pod deletion/creation.

Similar solutions can be referred to at:

Related issue number

Checks

kevin85421 commented 1 month ago

Hi @Eikykun, thank you for the PR! I will review it next week. Are you on Ray Slack? We can iterate more quickly there since this is a large PR. My Slack handle is "Kai-Hsun Chen (ray team)". Thanks!

kevin85421 commented 1 month ago

I will review this PR tomorrow.

kevin85421 commented 1 month ago

cc @rueian Would you mind giving this PR a review? I think I don't have enough time to review it today. Thanks!

rueian commented 1 month ago

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Eikykun commented 4 weeks ago

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

rueian commented 3 weeks ago

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

According to https://github.com/ray-project/kuberay/issues/715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

kevin85421 commented 3 weeks ago

I am reviewing this PR now. I will try to review this PR an iteration every 1 or 2 days.

kevin85421 commented 2 weeks ago

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

Eikykun commented 2 weeks ago

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

Gotit. From a problem-solving standpoint, if we don't rely on an informer in the controller and directly query the ApiServer for pods, the cache consistency issue with etcd wouldn't occur. However, this approach would increase network traffic and affect reconciliation efficiency. As far as I understand, the Resync() method in DeltaFIFO is not intended to ensure cache consistency with etcd, but rather to prevent event loss by means of periodic reconciliation.

Eikykun commented 2 weeks ago

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

thanks for your review, I will review the pr issue and resolve the conflicts later.

kevin85421 commented 2 weeks ago

@Eikykun would you mind installing pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md and fixing the linter issues? Thanks!

kevin85421 commented 2 weeks ago

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Follow up for ^

Eikykun commented 2 weeks ago

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Sorry, I didn't have time to reply a few days ago.

ActiveExpectationItem is removed after fulfilling its expectations. Therefore, the memory usage depends on how many pods that are being created or deleted are not yet synchronized to the cache. It might not actually consume much memory? Also, ControllerExpectations caches each pod's UID: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L364 Therefore, I'm not quite sure which one is lighter, ActiveExpectationItem or ControllerExpectations.

I started with ControllerExpectations in RayCluster from the beginning. But I'm a bit unsure why I switched to ActiveExpectationItem; perhaps it was more complicated. ControllerExpectations requires using PodEventHandler to handle Observed logic. RayCluster needs to implement PodEventHandler logic separately.