kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.54k stars 1.59k forks source link

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

Open deepk2u opened 1 year ago

deepk2u commented 1 year ago

Environment

Steps to reproduce

We have around 125 recurring runs within a single namespace. After a few months of historical runs, we have started seeing performance issues in the k8s cluster.

After digging deeper, we found that we are seeing timeouts in the calls to etcd. When we checked the etcd database for objects we found that one particular namespace which has 125 recurring runs is taking 996MB of etcd space

some data to look at:

Entries by 'KEY GROUP' (total 1.6 GB):
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
|                                                KEY GROUP                                                 |              KIND              |  SIZE  |
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
| /registry/kubeflow.org/scheduledworkflows/<namespace1>                        | ScheduledWorkflow              | 996 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace2>                       | ScheduledWorkflow              | 211 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace3>                           | ScheduledWorkflow              | 118 MB |

.....

namespace1 has 123 recurring runs namespace2 has 40 recurring runs namespace3 has 63 recurring runs

Expected result

Looks like we are storing a lot of unnecessary information in the ScheduledWorkflow object, which eventually is taking space in etcd database resulting in all the performance issues

Materials and Reference


Impacted by this bug? Give it a 👍.

connor-mccarthy commented 1 year ago

/assign @gkcalat

gkcalat commented 1 year ago

Hi @deepk2u! It may be due to insufficient resource provisioning or the lack of etc maintenance (see here). How longs did it take for you to reach these numbers?

deepk2u commented 1 year ago

It's an eks cluster. We connected with AWS support and they are maintaining the cluster and running defragmentation and doing all kinds of maintenance for etcd.

the oldest object I have on the list is from 14th July 2022.

gkcalat commented 1 year ago

Can check how large are your pipeline manifests used in these recurring runs?

github-actions[bot] commented 12 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 9 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

kuldeepjain commented 7 months ago

/reopen

google-oss-prow[bot] commented 7 months ago

@kuldeepjain: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubeflow/pipelines/issues/8757#issuecomment-1878046951): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
deepk2u commented 7 months ago

/reopen

google-oss-prow[bot] commented 7 months ago

@deepk2u: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/8757#issuecomment-1878055335): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rimolive commented 4 months ago

Closing this issue. No activity for more than a year.

/close

google-oss-prow[bot] commented 4 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/8757#issuecomment-2035009457): >Closing this issue. No activity for more than a year. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rimolive commented 2 months ago

/reopen

We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects.

google-oss-prow[bot] commented 2 months ago

@rimolive: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/8757#issuecomment-2174473608): >/reopen > >We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.