[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation

kubeflow / pipelines

Machine Learning Pipelines for Kubeflow

https://www.kubeflow.org/docs/components/pipelines/

Apache License 2.0

3.55k stars 1.6k forks source link

[backend] ml-pipeline-visualizationserver and ml-pipeline-ui-artifact per user namespace resource allocation #9555

Open andre-lx opened 1 year ago

andre-lx commented 1 year ago

Environment

How did you deploy Kubeflow Pipelines (KFP)? Kubeflow deployment
KFP version: 1.8.2

Steps to reproduce

I didn't find any reference for this issue.

At this moment, 2 pods are created for each namespace. The pipelines are running smoothly, but this 2 pods per namespace are making our Kubernetes cluster a lot more expensive.

Imagine the following scenario:

500 users, each one with their own namespace, 2 pods per namespace = 1000 pods
each node runs up to 100 pods

You will get up to 10 nodes, only to have these 2 pods per user. Even if the user does not use the pipelines.

A practical example:

> kubectl get pods --no-headers -A -o wide | grep ip-xx-xx-xx-xx.xx-west-2.compute.internal | wc -l
110
> kubectl get pods --no-headers -A -o wide | grep ip-xx-xx-xx-xx.xx-west-2.compute.internal | grep ml-pipeline | wc -l
100

My question is, what can we do to improve this costs. For example, there are any way of not creating the pods, or creating the pods only when they are necessary.

Expected result

Since this takes a lot of unnecessary resources, should exist a way of improving this.

Thanks

Impacted by this bug? Give it a 👍.

connor-mccarthy commented 1 year ago

/assign @zijianjoy

zijianjoy commented 1 year ago

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in https://docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

andre-lx commented 1 year ago

Hi @zijianjoy

Thanks for the quick around.

This is indeed one possible solution on the short term and we will be using it.

Unfortunately, this removed the possibility to download artifacts as you mentioned and the artifacts page does not load, so I hope this gets solved in a future release.

Thanks again, André

juliusvonkohout commented 1 year ago

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in https://docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

@zijianjoy Luckily that is not true. You can easily disable the deprecated visualization server and switch the ml-pipeline ui to not use the resource hogging artifact proxy. It can use minio directly by changing one environment variable. So both components are unnecessary.

To make this secure only the namespace parameter has to be enforced in the UI as explained here https://github.com/kubeflow/pipelines/issues/8406#issuecomment-1640918121

andre-lx commented 1 year ago

Thank you @andre-lx , the concern makes sense in the case of high amount of namespaces. I am reading the past design decision in docs.google.com/document/d/1YNxKUbJLnBRL7DbPn76fsShkQx5Q5jTc-iXfLmLt1FU/edit. The concern is over-granting permission to a single service account. If you would like, I think the current workaround is to avoid creating visualization server and artifact fetcher by modifying profile controller. It will remove the feature of downloading artifact and tensorboard, but I think it can mitigate the issue in short term.

@zijianjoy Luckily that is not true. You can easily disable the deprecated visualization server and switch the ml-pipeline ui to not use the resource hogging artifact proxy. It can use minio directly by changing one environment variable. So both components are unnecessary.

To make this secure only the namespace parameter has to be enforced in the UI as explained here #8406 (comment)

Hi @juliusvonkohout. This make sense.

For now, and for version 1.8.5 there are any workarround to fix this issue? Use the artifacts without the two pods per namespace?

Thanks

juliusvonkohout commented 1 year ago

@andre-lx I can help with the open source implementation, but solving this for a single user is more of a paid consulting question ;-). If you want that, reach out on slack. As a hint: it is doable in Kubeflow 1.7 but still as insecure as the current situation. You can put this on the agenda for the next KFP meeting or order consulting.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout commented 11 months ago

This issue is only becoming more relevant and is definitely not stale.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout commented 8 months ago

not stale

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout commented 5 months ago

Not stale.

juliusvonkohout commented 3 months ago

@zijianjoy @rimolive can you freeze the lifecycle of the Issue? It is still relevant.

rimolive commented 3 months ago

Sure, @juliusvonkohout

/lifecycle frozen