kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.6k stars 1.62k forks source link

[feature] google-cloud component for loading existing VertexDataset #7861

Closed defoishugo closed 7 months ago

defoishugo commented 2 years ago

Feature Area

/area sdk /area samples /area components

What feature would you like to see?

A new component to load existing VertexDataset. Related to #7792

What is the use case or pain point?

As a user, I have one existing dataset in VertexAI. I am doing several experiments with different models. Each of my experiment is represented by a pipeline.

When developing a kubeflow pipeline for VertexAI, I would like to be able to load an existing VertexDataset instead of using the dataset creation component. But today, the dataset reading component is not existing so I am not able to do it.

Is there a workaround currently?

Today, i am not able to do the task. I tried the following:

@component(base_image="python:3.9", packages_to_install=["google-cloud-aiplatform"])
def get_data(
    project: str,
    region: str,
    bucket: str,
    dataset: Output[VertexDataset]
):
    from google.cloud import aiplatform
    dataset = aiplatform.datasets._Dataset(TEST_ID, project=project, location=region)

This one is dropping the following error: NameError: name 'VertexDataset' is not defined.


Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

connor-mccarthy commented 2 years ago

Thanks, @defoishugo. This development is currently in progress and should be released with an upcoming v2 alpha release!

defoishugo commented 2 years ago

Thank you @connor-mccarthy.

Just to let you know, I found this workaround for my task:

@component(base_image="python:3.9", packages_to_install=["google-cloud-aiplatform"]) 
def get_data(project: str, region: str, dataset_id: str, dataset: Output[Artifact]):
    from google.cloud import aiplatform as aip
    vertex_dataset = aip.TabularDataset(dataset_id, project=project, location=region)
    dataset.metadata["resourceName"] = vertex_dataset.resource_name
    dataset.uri = ""

The output is not a VertexDataset object but it will work if you pass it to a custom job like CustomContainerTrainingJobRunOp since the job is only using the resource_name metadata to get the dataset.

Still, this is not a clean solution. Thank you for developing this feature, waiting for it!

connor-mccarthy commented 2 years ago

Glad you found a workaround, @defoishugo! Programmatic import of the type annotation import is indeed the crux of this problem for Vertex artifacts. We have a nice solution for it that I think we should be able to ship soon!

adhaene-noimos commented 2 years ago

@connor-mccarthy Do you have any update regarding this issue?

Current work-arounds are insufficient The work-around I have been using is the use of the importer_node as mentioned in this article - while this should intuitively work for loading Artifacts and functionally does the job, it duplicates entries within the ML Metadata store in the VertexAI project.

Loading existing Artifacts is a key MLOps functionality As a user, it seems like there is clearly something missing that would allow one to link an Artifact to multiple pipelines and multiple pipeline runs without duplicating ML Metadata entries within VertexAI. Use cases include running multiple training runs using different models on the same input Dataset, using the same trained model on multiple datasets, re-using the trained model artifact for model evaluation and deploying in separate pipelines, etc.

connor-mccarthy commented 2 years ago

it duplicates entries within the ML Metadata store in the VertexAI

Using the importer argument reimport=False in kfp==2.x.x should avoid duplicating entries in ML Metadata. I think this should resolve the issues you are describing. If not, can you let me know what gaps you're still experiencing?

adhaene-noimos commented 2 years ago

Thank you for replying @connor-mccarthy!

Some Artifact types are not duplicated, even before kfp==2.x.x Datasets and Endpoints seem to show the desired behavior. However, dsl.Model is duplicated when importing.

Pre-release is not yet supported Hoping this will be fixed in version 2.x.x, however currently there is no version of google-cloud-pipeline-components that supports the pre-release, so I can not experiment with this yet. Looking forward to their next version update.

connor-mccarthy commented 2 years ago

Thank you for explaining. I have made a note of this bug.

sumanassarmathg commented 1 year ago

Glad you found a workaround, @defoishugo! Programmatic import of the type annotation import is indeed the crux of this problem for Vertex artifacts. We have a nice solution for it that I think we should be able to ship soon!

Any update on this, it's been several months? Facing the same issue when trying to pass the output of a BigqueryQueryJobOp which is of type google.BQTable to another component.

connor-mccarthy commented 1 year ago

@sumanassarmathg, support for google-namespaced artifact types was released with 2.0.0b5.

defoishugo commented 1 year ago

@connor-mccarthy

Current work-arounds are insufficient The work-around I have been using is the use of the importer_node as mentioned in this article - while this should intuitively work for loading Artifacts and functionally does the job, it duplicates entries within the ML Metadata store in the VertexAI project.

By the way, the current work-arounds does not help for the following use case: considering a dataset ID as a parameter of the pipeline (which is the case for the trainings pipelines).

The three solutions to import the dataset would be the followings:

Currently, there is no solution for importing a VertexDataset using only the dataset ID. The same issue could occur with the model, that you could not import based on VertexModel ID.

Do we have any solution regarding this issue? @adhaene-noimos maybe you found work-arounds?

And a more important question is: considering we could take a parameterized VertexDataset as input? What is the good practice? What is the vision of kubeflow and VertexAI on this very important topic?

In my mind, lineage is really the core of MLOps and from what I see, I am not the only one who's training a lot of models on VertexAI and want to have one pipeline for all of them... Which means that we should be able to have a dataset as an input of a pipeline (could be a dataset ID or another way).

rimolive commented 7 months ago

Closing this issue. No activity for more than a year.

/close

google-oss-prow[bot] commented 7 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/7861#issuecomment-2035032836): >Closing this issue. No activity for more than a year. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.