kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.55k stars 1.6k forks source link

[frontend] UI is failing to render Large PIpeline DAGs with "no graph to show" error #10011

Closed sachdevayash1910 closed 5 months ago

sachdevayash1910 commented 11 months ago

What we are trying to do

We are trying to execute pipelines with 400-500 components, which on average have 10-15 inputs/outputs each but some components may have close to 100s of inputs and outputs. This results in a Workflow size of about 700KB which only grows as the number of ParallelFors execute depending on which use case is being run in a particular pipeline.

Expected result

The pipelines should run to completion and we should be able to see the graph

Issue we are seeing

image


We had seen the same issue earlier when our pipelines were smaller. At the time we had build the ml-pipeline-ui image from the master branch which had a fix for the issue raised via https://github.com/kubeflow/pipelines/pull/8343. Currently we are running the following commit of the master branch of the pipelines repo: 1bed63a31e7ac5e7ba122a6695f9aa40449a22aa We did not run into any issues since Feb. However, now our pipelines are much larger and we have run into this issue again. To be on the safe side, I tried to upgrade the UI image to version 2.0.0-alpha7 but as I understand, the version has the same fix which I have already deployed which is why the issue was not resolved. Would appreciate any input on how to resolve this. This is blocking us from running pipelines beyond a point.

Additional context:

Our pipelines are too large and already exceeded the limits for workflows and we started receiving workflow is longer than maximum allowed size. compressed size 1055604 > maxSize 1048576T This is the same as this issue: https://github.com/awslabs/kubeflow-manifests/issues/767

For this I tried and was able to turn on the workflow offloading feature provided by Argo : https://argoproj.github.io/argo-workflows/offloading-large-workflows/#:~:text=Argo%20stores%20workflows%20as%20Kubernetes,This%20can%20be%20over%201MB.

This actually allowed the pipelines to run successfully (they were stuck/failed before due the size error) but we then started to see these UI issues @zijianjoy @chensun Would appreciate your inputs

Impacted by this bug? Give it a 👍.

zijianjoy commented 11 months ago

@droctothorpe has a PR for this fix: https://github.com/kubeflow/pipelines/pull/9351.

sachdevayash1910 commented 11 months ago

Thanks @zijianjoy, I will check it out. But isn't that just to get the error? Our pipelines aren't exceeding 1000 nodes and yet we don't see a graph

zijianjoy commented 11 months ago

@sachdevayash1910 You can also open the Developer console in your browser to see if there is any error shown on the console.

Alternatively, you can also share the pipeline template here so we can reproduce.

droctothorpe commented 11 months ago

The error also surfaces if you have 1000 connections / edges in your graph, even if you have less than 1000 nodes.

droctothorpe commented 11 months ago

One options is to break your pipeline up into sub-pipelines and have the last component of the first pipeline trigger the second, etc.

Another option is to watch pipelines in Argo Workflows, the Argo Workflows CLI, or K9s. Not as pretty as the KFP frontend though.

TristanGreathouse commented 11 months ago

@droctothorpe @zijianjoy I work with @sachdevayash1910 and am one of the primary developers on our pipelines. We definitely have greater than 1000 edges in some of our larger DAGs, and potentially could exceed 1000 nodes depending on the use-case and how the pipelines are configured.

Why is there specifically a 1000 edge and/or node limit for the KF UI? Is there any way this can be increased or are there any plans to fix this in the future?

We can't share pipelines with our images, but if it'd be critical for debugging, we could work on a dummy pipeline with the same inputs and outputs for every component running on generic images and running minimal code to reproduce the UI issue. Would this be helpful?

zijianjoy commented 11 months ago

@TristanGreathouse If the number of edges and nodes are too many, there is a chance that the web page will freeze due to failure of rendering a large graph. One thing to consider is by packaging pipeline as a component, which is a SubDAG. This can reduce the number of nodes and edges in each rendering. https://www.kubeflow.org/docs/components/pipelines/v2/pipelines/pipeline-basics/#pipelines-as-components

If the UI can handle more than 1000 nodes and edges, feel free to increase the limit by creating a PR.

TristanGreathouse commented 11 months ago

@zijianjoy sub-dags in KF is something we've wanted for quite a while, so I'm very glad to see it's been released. This is definitely preferred functionality for fixing our problems, however we're running into some snags testing out the examples.

I tried to upload the toy pipeline from the docs. In order to compile it I upgraded to kfp==2.3.0 (up from 1.8.21). However, when I go to upload the pipelines in the KF UI, I faced the below screenshotted error. I also attempted to upload pipelines and start runs from a template using the KFP client, however we get the following warning before our client fails to connect with the cluster. Our current KF pipelines backend is 2.0.0-alpha.5 which comes with KF 1.6.1.

/home/inferno/miniconda/lib/python3.8/site-packages/kfp/client/client.py:158: FutureWarning: This client only works with Kubeflow Pipeline v2.0.0-beta.2 and later versions.

Do we need to install a V2 backend in order to upload and run pipelines compiled with the V2 SDK? If so, which backend version should we use? We tried to reference the docs for installation, but they just say "This page will be availble soon". Any guidance on V2 compatible versioning and documentation for installation would be greatly appreciated as we're eager to test out V2 sub-dag functionality.

Screen Shot 2023-09-25 at 4 52 11 PM

CC: @sachdevayash1910

noodleai commented 11 months ago

Hi all, after applying the Argo fix suggested by @sachdevayash1910 by setting nodeStatusOffLoad: true and now larger pipelines will execute. However, these offloaded pipelines will now fail in the UI.

After some Chrome dev-tools debugging (not a UI developer by any means), I found the following:

  1. The variable graph is present in a non-offloading pipeline: !image

  2. However, it's undefined in a larger, offloading one: !image

  3. Upon digging a bit deeper into the RunDetails.tsx file I found that the following must be true: workflow && workflow.status && workflow.status.nodes. However, workflow has no value nodes but does have a offloadNodeStatusVersion entry: !

  4. For a non-offloading pipeline workflow does have a nodes entry, but no offloadNodeStatusVersion: !image

@zijianjoy any pointers on how we can get the nodes into workflow when using nodeStatusOffLoad: true?

zijianjoy commented 11 months ago

nodeStatusOffLoad is an argo feature that we haven't supported. If you would like to contribute, you would need to identify the location of the uncompressed graph:

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 5 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.