Closed defoishugo closed 7 months ago
Thanks, @defoishugo. This development is currently in progress and should be released with an upcoming v2 alpha release!
Thank you @connor-mccarthy.
Just to let you know, I found this workaround for my task:
@component(base_image="python:3.9", packages_to_install=["google-cloud-aiplatform"])
def get_data(project: str, region: str, dataset_id: str, dataset: Output[Artifact]):
from google.cloud import aiplatform as aip
vertex_dataset = aip.TabularDataset(dataset_id, project=project, location=region)
dataset.metadata["resourceName"] = vertex_dataset.resource_name
dataset.uri = ""
The output is not a VertexDataset
object but it will work if you pass it to a custom job like CustomContainerTrainingJobRunOp
since the job is only using the resource_name
metadata to get the dataset.
Still, this is not a clean solution. Thank you for developing this feature, waiting for it!
Glad you found a workaround, @defoishugo! Programmatic import of the type annotation import is indeed the crux of this problem for Vertex artifacts. We have a nice solution for it that I think we should be able to ship soon!
@connor-mccarthy Do you have any update regarding this issue?
Current work-arounds are insufficient
The work-around I have been using is the use of the importer_node
as mentioned in this article - while this should intuitively work for loading Artifacts and functionally does the job, it duplicates entries within the ML Metadata store in the VertexAI project.
Loading existing Artifacts is a key MLOps functionality As a user, it seems like there is clearly something missing that would allow one to link an Artifact to multiple pipelines and multiple pipeline runs without duplicating ML Metadata entries within VertexAI. Use cases include running multiple training runs using different models on the same input Dataset, using the same trained model on multiple datasets, re-using the trained model artifact for model evaluation and deploying in separate pipelines, etc.
it duplicates entries within the ML Metadata store in the VertexAI
Using the importer
argument reimport=False
in kfp==2.x.x
should avoid duplicating entries in ML Metadata. I think this should resolve the issues you are describing. If not, can you let me know what gaps you're still experiencing?
Thank you for replying @connor-mccarthy!
Some Artifact types are not duplicated, even before kfp==2.x.x
Datasets and Endpoints seem to show the desired behavior. However, dsl.Model
is duplicated when importing.
Pre-release is not yet supported
Hoping this will be fixed in version 2.x.x, however currently there is no version of google-cloud-pipeline-components
that supports the pre-release, so I can not experiment with this yet. Looking forward to their next version update.
Thank you for explaining. I have made a note of this bug.
Glad you found a workaround, @defoishugo! Programmatic import of the type annotation import is indeed the crux of this problem for Vertex artifacts. We have a nice solution for it that I think we should be able to ship soon!
Any update on this, it's been several months? Facing the same issue when trying to pass the output of a BigqueryQueryJobOp
which is of type google.BQTable
to another component.
@sumanassarmathg, support for google-namespaced artifact types was released with 2.0.0b5
.
@connor-mccarthy
Current work-arounds are insufficient The work-around I have been using is the use of the
importer_node
as mentioned in this article - while this should intuitively work for loading Artifacts and functionally does the job, it duplicates entries within the ML Metadata store in the VertexAI project.
By the way, the current work-arounds does not help for the following use case: considering a dataset ID as a parameter of the pipeline (which is the case for the trainings pipelines).
The three solutions to import the dataset would be the followings:
Currently, there is no solution for importing a VertexDataset using only the dataset ID. The same issue could occur with the model, that you could not import based on VertexModel ID.
Do we have any solution regarding this issue? @adhaene-noimos maybe you found work-arounds?
And a more important question is: considering we could take a parameterized VertexDataset as input? What is the good practice? What is the vision of kubeflow and VertexAI on this very important topic?
In my mind, lineage is really the core of MLOps and from what I see, I am not the only one who's training a lot of models on VertexAI and want to have one pipeline for all of them... Which means that we should be able to have a dataset as an input of a pipeline (could be a dataset ID or another way).
Closing this issue. No activity for more than a year.
/close
@rimolive: Closing this issue.
Feature Area
/area sdk /area samples /area components
What feature would you like to see?
A new component to load existing VertexDataset. Related to #7792
What is the use case or pain point?
As a user, I have one existing dataset in VertexAI. I am doing several experiments with different models. Each of my experiment is represented by a pipeline.
When developing a kubeflow pipeline for VertexAI, I would like to be able to load an existing VertexDataset instead of using the dataset creation component. But today, the dataset reading component is not existing so I am not able to do it.
Is there a workaround currently?
Today, i am not able to do the task. I tried the following:
This one is dropping the following error:
NameError: name 'VertexDataset' is not defined
.Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.