kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.55k stars 1.6k forks source link

[feature] Allow apache-beam version greater than 2.50.0 on dataflow component (Vertex AI) #11017

Open caetano-colin opened 1 month ago

caetano-colin commented 1 month ago

Feature Area

/area sdk /area components

What feature would you like to see?

Support for more recent apache-beam versions on Google Cloud Dataflow Component (https://cloud.google.com/vertex-ai/docs/pipelines/dataflow-component)

What is the use case or pain point?

Currently, the apache beam version being used for the google cloud pipeline component is 2.50.0, which Google Cloud Dataflow will deprecate on August 30, 2024 and has known issues (https://cloud.google.com/dataflow/docs/support/sdk-version-support-status).

The dockerfile for the image gcr.io/ml-pipeline/google-cloud-pipeline-components:2.15.0 seems to be: https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/Dockerfile#L38

Is there a workaround currently?

DataflowPythonJobOp does not seem to have a field for replacing custom images.

There is a field for passing a requirements.txt file, which would probably work if the container running it has network access. However, on secure/isolated environments, where the docker images must have been previously built, the container would not have access to the PyPi repository, therefore it will not be able to download packages specified in that file. In that case, the user would have no choice but to use 2.50.0 version.


Love this idea? Give it a 👍.

rimolive commented 1 month ago

/cc @zijianjoy @chensun @connor-mccarthy @james-jwu