kubeflow / kfp-tekton

Kubeflow Pipelines on Tekton
https://developer.ibm.com/blogs/kubeflow-pipelines-with-tekton-and-watson/
Apache License 2.0
171 stars 121 forks source link

Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725

Open joeswashington opened 3 years ago

joeswashington commented 3 years ago

We would like to have the ability to pass the results of a pandas dataframe operation from one pipeline component to another without having to create an input / output component.

In this case, we would need to make a CSV file in one component and share in the other component which is slow.

pugangxa commented 3 years ago

What do you. mean of passing the results of a pandas dataframe? So if it's just internal in python I think you should include them in the same component. Tekton support passing data with results or workspace, and kfp support using artifacts, this is the standard way for sharing data between components, so maybe can consider how to split your logic.

Tomcli commented 3 years ago

For @joeswashington 's use case, we probably need to invent a new custom task controller that trying to do similar things in Spark where the output of a pipeline task can be stored in the Spark driver's memory. These kind of use case usually is addressed in the Spark community instead of Tekton, so I would recommend to run all the data frame processing on a Spark cluster and use KFP-Tekton component as the Spark client.

Ark-kun commented 2 years ago

@joeswashington Are you sure your request is feasible?

The producer and consumer tasks probably run on different machines. So the producer need to send out the data using network and the consumer container needs to receive the data from network. Also, the producer and consumer run at different time (the consumer task is only started after the producer task finishes). So the data needs to be stored somewhere. The intermediate data storage is also important for cache reuse. You don't want to do the same data processing or training multiple times.

So, it looks like it's inevitable that the produced data is uploaded somewhere and downloaded when it needs to be consumed. You cannot really have a distributed system without passing data over network.

P.S. KFP has a way to seamlessly switch all data-passing to a Kubernetes volume, but we do not really see people using that feature. Kubernetes volumes are also accessed over the network...

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.