kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

Open demarna1 opened 2 weeks ago

demarna1 commented 2 weeks ago

Goal

Fix high ETCD usage of Kubeflow ScheduledWorkflows. Closes https://github.com/kubeflow/pipelines/issues/8757

Context

Every time the ScheduledWorkflow controller syncs a SWF resource, it updates the Last Heartbeat Time and Last Transition Time to the current time in the status block.

Status:
  Conditions:
    Last Heartbeat Time:   2024-11-07T11:16:33Z
    Last Transition Time:  2024-11-07T11:16:33Z
    Message:               The schedule is disabled.
    Reason:                Disabled
    Status:                True
    Type:                  Disabled

These heartbeat updates result in an infinite reconciliation loop:

Description of the fix

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 for now in order to fix the ETCD performance issues (which for us has resulted in ETCD outages). By keeping these fields constant, the object can be reconciled and the writes to ETCD stop. The schedules continue to function as before. Verbose logging is significantly reduced in several pods. A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

ETCD performance before & after

I measured ETCD bytes written for all resources on our cluster over a 10 minute time span. Once this fix was instituted, we saw a dramatic decrease in ETCD usage (see chart below).

The chart roughly agrees with the back-of-the-napkin math:

etcd

Checklist:

google-oss-prow[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign rimolive for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[backend/OWNERS](https://github.com/kubeflow/pipelines/blob/master/backend/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
google-oss-prow[bot] commented 2 weeks ago

Hi @demarna1. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
droctothorpe commented 2 weeks ago

/ok-to-test

hbelmiro commented 2 weeks ago

Also @demarna1, can you please link the PR to the issue?

demarna1 commented 2 weeks ago

@hbelmiro linked the PR to the issue.

The LastProbeTime and LastTransitionTime fields in the ScheduledWorkflow Status are unused by Kubeflow so it is safe to set these fields to 0 (...) A long-term plan for these fields should be determined (it may be best to remove them from the CRD entirely).

Any reason for not removing them right now?

My first priority is addressing the ETCD performance issue and I didn't want a CRD change to delay it. But I see no reason we can't remove them and I'd be happy to do that in a follow-on PR!

droctothorpe commented 1 week ago

@hbelmiro do you happen to know why the ok-to-test label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.

hbelmiro commented 1 week ago

@hbelmiro do you happen to know why the ok-to-test label is no longer triggering the workflows / CI? It used to be sufficient as recently as a few weeks ago.

@droctothorpe I don't know :( It seems like something has changed in the repo's permissions. The following used to work for first-time contributors.

/rerun-all /ok-to-test

droctothorpe commented 1 week ago

Thanks, @hbelmiro! @HumairAK @zijianjoy do you happen to know if this change was intentional? It's out of sync with this documentation about membership privileges.

droctothorpe commented 3 days ago

Bump.