Open demarna1 opened 2 weeks ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign rimolive for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Hi @demarna1. Thanks for your PR.
I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
/ok-to-test
Also @demarna1, can you please link the PR to the issue?
@hbelmiro linked the PR to the issue.
The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 (...) A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).
Any reason for not removing them right now?
My first priority is addressing the ETCD performance issue and I didn't want a CRD change to delay it. But I see no reason we can't remove them and I'd be happy to do that in a follow-on PR!
@hbelmiro do you happen to know why the ok-to-test
label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.
@hbelmiro do you happen to know why the
ok-to-test
label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.
@droctothorpe I don't know :( It seems like something has changed in the repo's permissions. The following used to work for first-time contributors.
/rerun-all /ok-to-test
Thanks, @hbelmiro! @HumairAK @zijianjoy do you happen to know if this change was intentional? It's out of sync with this documentation about membership privileges.
Bump.
Goal
Fix high ETCD usage of Kubeflow ScheduledWorkflows. Closes https://github.com/kubeflow/pipelines/issues/8757
Context
Every time the ScheduledWorkflow controller syncs a SWF resource, it updates the Last Heartbeat Time and Last Transition Time to the current time in the status block.
These heartbeat updates result in an infinite reconciliation loop:
Description of the fix
The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 for now in order to fix the ETCD performance issues (which for us has resulted in ETCD outages). By keeping these fields constant, the object can be reconciled and the writes to ETCD stop. The schedules continue to function as before. Verbose logging is significantly reduced in several pods. A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).
ETCD performance before & after
I measured ETCD bytes written for all resources on our cluster over a 10 minute time span. Once this fix was instituted, we saw a dramatic decrease in ETCD usage (see chart below).
The chart roughly agrees with the back-of-the-napkin math:
Checklist: