kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

Open OutSorcerer opened 2 weeks ago

OutSorcerer commented 2 weeks ago

Description of your changes:

Checklist:

google-oss-prow[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign ark-kun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[backend/metadata_writer/OWNERS](https://github.com/kubeflow/pipelines/blob/master/backend/metadata_writer/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
google-oss-prow[bot] commented 2 weeks ago

Hi @OutSorcerer. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
hbelmiro commented 2 weeks ago

/ok-to-test

ishaan-mehta commented 2 days ago

Is there an estimate on when this could be approved and merged, and/or is there anything I can do to help? Just curious as my deployment is running into the same issue.

This is more severe than just a pod restarting repeatedly, as when the pod is down, Kubeflow is seemingly unable to properly authorize users for namespaces.

thesuperzapper commented 1 day ago

@kubeflow/pipelines maintainers can we get some eyes on this important PR (It needs some work, but is important as it fixes a critical issue that prevents KFP metadata-writer working on some Kubernetes distros).

For more context, please see my comment here:

But the issue is simply that some TCP sockets are timing out when we are watching Kubernetes resources from python.