kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
430 stars 216 forks source link

adding timeout for cache sync #618

Closed emsixteeen closed 7 months ago

emsixteeen commented 8 months ago

Currently, if informer cache syncing fails (typically because lack of permissions), there is no indication of this error. A message is logged indefinitely, the Pod remains in Running state, and "seems" to be healthy.

The PR adds a timeout to cache syncing, which will fail the controller if caches are not synced [within the timeout period].

google-cla[bot] commented 8 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-oss-prow[bot] commented 8 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tenzen-y commented 8 months ago

WDYT @tenzen-y? Maybe a better approach would be to have the readiness probe be aware of whether the reconcilers are running?

@alculquicondor It makes sense. As I explore the timeout setting in the kube-controller-manager and the job-controller, the kube-controller-manager seems not to have such a setting for syncing cache and doesn't generate context with timeout.

alculquicondor commented 8 months ago

It makes sense.

Sorry, can you clarify what exactly makes sense for you?

emsixteeen commented 8 months ago

The use case at hand is to fail the controller early if access is denied to resources. Doing this via a timeout might be a bit of a kludge, as it doesn't really address the underlying issue.

After some more digging, it looks like an error handler can be set on informers: emsixteeen/mpi-operator@400b9fff.

Perhaps this is a better approach?

alculquicondor commented 8 months ago

I like that approach better, but is there a way to do it just by iterating overall all in the informers that are instantiated for the SharedInformer?

tenzen-y commented 8 months ago

It makes sense.

Sorry, can you clarify what exactly makes sense for you?

I meant that the kube-contoller-manager doesn't set such a timeout mechanism in the informer. So, I would like to take readinessprobe approach.

However, setting the error handler to informers would be better than the readiness probe.

emsixteeen commented 8 months ago

I like that approach better, but is there a way to do it just by iterating overall all in the informers that are instantiated for the SharedInformer?

Not sure if it's possible to iterate over all the informers.

It seems that every Informer returns its own interface, which usually has a method called Informer() which in turns returns the SharedInformer interface (via the SharedIndexInformer interface). It's the SharedInformer interface that provides the SetWatchErrorHandler method.

I.e. I'm not aware of some sort of collection of SharedInformer interfaces that can easily be iterated over... 😕

alculquicondor commented 8 months ago

Yeah, I don't think they are exposed in the informer factory.

alculquicondor commented 8 months ago

Please update the PR with the error handler

emsixteeen commented 8 months ago

In lieu of adding the error handler to this PR, I created a separate PR for it (#619).

My thought process is that these PRs address two separate "features": this PR adds a timeout / limit for how long the operator will wait for caches to sync, and the other PR will cause the operator to fail if there are permission issues.

To @tenzen-y point, all of this could technically be handled in a readiness (or startup) probe, and Kubernetes could just fail the pod if it's unready after some time. However, the current readiness probe limits itself to leader election and is defined in server.go, whereas these two "features" are handled in mpi_job_controller.go.

Because InstallPathHandler can only be called once, some deeper refactoring would be needed to move these two health checks near each other, so as to ensure that InstallPathHandler is only called once.

These two "features" are probably valid on their own, regardless of the readiness probe... (?)

alculquicondor commented 7 months ago

Assuming there are no authorization errors, I don't see a reason why to have a timeout for cache sync.

It should eventually finish.

/close

google-oss-prow[bot] commented 7 months ago

@alculquicondor: Closed this PR.

In response to [this](https://github.com/kubeflow/mpi-operator/pull/618#issuecomment-1927224913): >Assuming there are no authorization errors, I don't see a reason why to have a timeout for cache sync. > >It should eventually finish. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
emsixteeen commented 7 months ago

Fair enough! This was an initial attempt to address an issue with the wrong implementation. :blush: