Open OutSorcerer opened 2 weeks ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign ark-kun for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Hi @OutSorcerer. Thanks for your PR.
I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
/ok-to-test
Is there an estimate on when this could be approved and merged, and/or is there anything I can do to help? Just curious as my deployment is running into the same issue.
This is more severe than just a pod restarting repeatedly, as when the pod is down, Kubeflow is seemingly unable to properly authorize users for namespaces.
@kubeflow/pipelines maintainers can we get some eyes on this important PR (It needs some work, but is important as it fixes a critical issue that prevents KFP metadata-writer working on some Kubernetes distros).
For more context, please see my comment here:
But the issue is simply that some TCP sockets are timing out when we are watching Kubernetes resources from python.
Description of your changes:
urllib3.exceptions.ReadTimeoutError
) ofk8s_watch.stream
to prevent crashes ofmetadata-writer
pod. Without thismetadata-writer
pod fails in cases when a connection error causes a client timeout. This should fix #8200.Checklist: