kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.53k stars 1.59k forks source link

[feature] Question about passing complex objects using pickle #6619

Closed drubinstein closed 4 months ago

drubinstein commented 2 years ago

Feature Area

/area backend /area sdk /area samples

What feature would you like to see?

May already be handled, but I notice that in _data_passing.py there's a Base64Pickle Converter. Would it be possible to provide some example code in the documentation that shows how to say return an object that will get pickled and then unpickled when used in a following step of a pipeline? Alternatively (and preferred), would it be possible to automatically detect complex objects and automatically pickle/unpickle them (though I think you could instead have everything that gets passed between python components go through pickle instead)?

What is the use case or pain point?

From what I've read and tried, KFP currently does not support more complex objects such as a numpy array or a datetime as a component arguments, parameters, return values etc. The current best way to handle that would be to convert them to a string and then parse it at the beginning of the next component step. If objects were pickled between components and then unpickled before being the python function component was called, then I could have more accurate types and reduce boilerplate code.

Is there a workaround currently?

Currently, I pass strings around and parse them at the beginning of my components. For example instead of having

def foo(day=datetime.datetime) -> str: 
  # something is done
  return "bar"

I'll do something closer to:

def foo(day=str) -> str: 
  day = datetime.datetime.strptime(day, "%Y-%m-%d")
  # something is done
  return "bar"

at the beginning of all my functions.

Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

chensun commented 2 years ago

Hi @drubinstein, sorry for the late reply.

For complex object such as numpy array, you can pass them by file. Here're some docs about passing by file: https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#passing-parameters-by-file or the v2 way if you're using Vertex Pipelines https://www.kubeflow.org/docs/components/pipelines/sdk-v2/v2-component-io/

Generally speaking pickling isn't a great idea. Technically it works for small object like dataframe, however you would be passing some unreadable string between components. If you have a large numpy array, the pickle data could be too large to be passed as string value, because components are eventually containerized apps, string values are passed into a container using command line, so it subjects to the limit of command line characters.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rimolive commented 4 months ago

Closing this issue. No activity for more than a year.

/close

google-oss-prow[bot] commented 4 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/6619#issuecomment-2016835273): >Closing this issue. No activity for more than a year. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.