Closed amarion35 closed 5 months ago
Looks like the same case as in this sample: https://github.com/kubeflow/pipelines/blob/master/samples/v2/pipeline_with_importer.py
According to this doc, importer
is used to import existing artifacts. But it also says that it is able to use as an artifact an external file that was not generated by a pipeline. Does it means it can import any file or folder from a gcs bucket ?
https://www.kubeflow.org/docs/components/pipelines/v2/components/importer-component/
I am right now testing this sample in a different environment, but I'd like to reinforce that part of the doc:
"If you wish to use an existing artifact that was not generated by a task in the current pipeline or wish to use as an artifact an external file that was not generated by a pipeline at all..."
So according to this, yes it is possible to inject external files to your pipeline as an artifact.
Did you have the chance to look at it?
Closing this issue. No activity.
/close
@rimolive: Closing this issue.
Feature Area
/area components
What feature would you like to see?
Maybe this feature already exists but I couldn't find it.
The first component of our pipelines is often a component that just downloads a dataset (folders and files) from a GCP bucket into a dsl.Dataset artifact.
In our cases the component looks like this:
Once the component is completed, the artifact is uploaded to minio. So the dataset is transfered twice, once from GCP to the pod and once from the pod to minio. Why not just do it in a single transfer ?
What is the use case or pain point?
Doing this in a single transfer would make the caching of datasets faster. The functionality could be encapsulated into a specific artifact or into a kubeflow component, so we wouldn't have rewrite this component.
Is there a workaround currently?
Not using the Dataset artifact and directly download the dataset in the components that needs it. Only use the artifacts in the intermediates (preprocessing) datasets.
Love this idea? Give it a 👍.