Closed parthmishra closed 3 months ago
Taking HTML visualization as an example, the usage on V2 will look like:
@component
def write_html(html_artifact: Output[HTML]):
html_content = '<!DOCTYPE html><html><body><h1>Hello world</h1></body></html>'
with open(html_artifact.path, 'w') as f:
f.write(html_content)
Here you specify Output[]
with artifact type HTML
. Similarly, you can do it for Model, Metrics, etc.
For example: https://github.com/kubeflow/pipelines/blob/master/samples/test/metrics_visualization_v2.py#L28
Does it work to your usecase?
cc @chensun
Taking HTML visualization as an example, the usage on V2 will look like:
@component def write_html(html_artifact: Output[HTML]): html_content = '<!DOCTYPE html><html><body><h1>Hello world</h1></body></html>' with open(html_artifact.path, 'w') as f: f.write(html_content)
Here you specify
Output[]
with artifact typeHTML
. Similarly, you can do it for Model, Metrics, etc.For example: https://github.com/kubeflow/pipelines/blob/master/samples/test/metrics_visualization_v2.py#L28
Does it work to your usecase?
cc @chensun
I'm aware that you can do this when creating a component with the v2 decorator, I'm asking if it's possible to do so in a component that was not generated with the v2 decorator. Like a completely separate component that is loaded in by its YAML definition (perhaps even a different language)
Hi @parthmishra,
At this moment, it would be quite challenging for a user to replicate the supports for Input[Model]
, Output[Metrics]
, etc. in their custom container.
Here's a sample code of what the container interface would look like using the v2 @component
decorator: https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/v2/compiler_cli_tests/test_data/lightweight_python_functions_v2_pipeline.json#L143-L149 (The Python code is inlined in the container commands, but they could be moved to inside the container).
If you were able to inspect that code sample, you would find that, other than the user code, the code also contains the entire code from following files: https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/components/executor.py https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/components/executor_main.py https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/dsl/io_types.py
So technically, you could implement your own version following this code. But we are not expecting users to do so.
It is on our roadmap that we would help users include such code in their own components by packaging and installing the code into their custom container -- assuming the container is Python based. For non-Python based container, we would document the expected interface so that users can follow that to implement their own.
@chensun following from this, is it possible to pass something like a pandas DataFrame (as a csv file probably) throughout custom components? For example, let's assume there are four components that all do different types of preprocessing for our data; how can we pass the data through without specifying an outside filepath (i.e. string that is a GCS path)?
@chensun
Thanks for the explanation, I think the v2 SDK docs for "regular" component building should state that these Artifact types are not useable and that users wishing to implement these inputs/outputs should instead write them using Python-function based components. The current docs are misleading in this regard and make it seem like there is equal feature parity between the two methods of implementing Components.
@chensun following from this, is it possible to pass something like a pandas DataFrame (as a csv file probably) throughout custom components? For example, let's assume there are four components that all do different types of preprocessing for our data; how can we pass the data through without specifying an outside filepath (i.e. string that is a GCS path)?
@nicokuzak
Yes, you can pass a DataFrame as a file throughout custom components. And you don't need to provide a GCS path yourself, the system generates such a path.
If you write component.yaml, you need {outputPath: output_name}
placeholder, or if you write Python-function based component, you need to type annotate the output like such: output_name: OutputPath('CSV')
. At runtime, your code should expect a local file path, and you need to dump the DataFrame object into the file. The downstream component can take such output as an input using {inputPath: input_name}
or input_name: InputPath('CSV')
. Then you can read from the file, and load it into a DataFrame object.
On Vertex Pipelines, the local path is backed by GCS Fuse, meaning it maps to a GCS location like gs://my-bucket/some-blob
, whatever content you write to the local file will be "synced" to the GCS location.
@chensun
Thanks for the explanation, I think the v2 SDK docs for "regular" component building should state that these Artifact types are not useable and that users wishing to implement these inputs/outputs should instead write them using Python-function based components. The current docs are misleading in this regard and make it seem like there is equal feature parity between the two methods of implementing Components.
@parthmishra Thank you for your feedback. Agree that our current doc isn't in a good shape. We will continuously improve our documentation. Meanwhile, we're designing the next generation of component authoring experience that could make these Artifacts types available for custom container components as well.
I would like to also add that it would be nice to be able to use the v2 Metrics capabilities with the manually written component.yaml
approach. Since this is not possible, I have to create op from python functions directly if I want to use the Metrics capability (for display in Vertex AI). As mentioned here in v1, this was possible by dumping to a specified JSON file and the Metrics were read from there. If there are any workarounds please let us know, thank you.
Preferred File structure example as noted: -build.sh -component.yaml -Dockerfile -src/train.py
This structure is optimal for more complex code and portability, however, without Metrics capability, cannot use this approach if Metrics needed.
I would like to also add that it would be nice to be able to use the v2 Metrics capabilities with the manually written
component.yaml
approach. Since this is not possible, I have to create op from python functions directly if I want to use the Metrics capability (for display in Vertex AI). As mentioned here in v1, this was possible by dumping to a specified JSON file and the Metrics were read from there. If there are any workarounds please let us know, thank you.Preferred File structure example as noted: -build.sh -component.yaml -Dockerfile -src/train.py
This structure is optimal for more complex code and portability, however, without Metrics capability, cannot use this approach if Metrics needed.
@vemqar I could be wrong, but If you're using Python for your component, you can use the component decorator to output a component.yaml
definition file (and specify custom base image as output of build.sh
). The function you decorate can essentially just be used for serializing inputs, passing them to the rest of your code (e.g. src/train.py
) and then serializing the outputs. Not ideal as you clutter up the component.yaml
file with in-lined code, but I don't see why it wouldn't work.
Perhaps you could also just directly use the KFP SDK to serialize Artifacts, essentially what the in-lined code is doing for Python-function based components anyways.
Thanks @parthmishra for the advice. I did try to essentially import the Metrics
class definition into my src/ code , but it doesn't work because it needs to be initialized. I realized that the kfp Artifact
needs to be initialized with GCP paths which will specify where the Metrics Artifacts will be stored. So what I understand, essentially I would have to myself initialize Artifact
with a uri
path then call Metrics
. If you have a working example though, would appreciate it. I did inspect the component.yaml
of a function with Metrics Output and didn't see an obvious way to integrate that into a custom written component file.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I found a somewhat hacky solution to this problem. I'm using Kubeflow's Executor
class (which is the one used by function-based components) to easily instantiate the Artifact objects. I could iterate through executor_input
and create all the objects myself, but I think it's a lot more convenient to use Executor
, even if I'm not using it for what is was designed.
You need to include {executorInput: null}
in your component.yaml file and your python script would look something like this:
from kfp.v2.components.executor import Executor
from kfp.v2.dsl import Metrics, Model
import argparse
import json
parser = argparse.ArgumentParser()
parser.add_argument("--executor-input", type=str, required=True)
args = parser.parse_args()
# carrega argumentos do executor
executor_input = json.loads(args.executor_input)
# carrega configuracoes
executor = Executor(executor_input, lambda x: x)
# obtem objetos Kubeflow
metrics:Metrics = executor._output_artifacts['metrics']
model:Model = executor._output_artifacts['model']
# log de metricas
metrics.log_metric("accuracy", 0.9)
# salva modelo
with open(model.path, "w") as f:
f.write("data")
# salva saidas
executor._write_executor_output()
I'm also attaching all the files necessary to run this example, as well as some screenshots to show you that it works (at least on Vertex AI pipelines). Just so we don't have to build and publish a docker image, I included the python script in the component.yaml file.
Edit: after commenting I realized what I did was kind of what was suggested in https://github.com/kubeflow/pipelines/issues/6116#issuecomment-885506281 . So I just wanted to give them credits.
@jordyantunes Men, you're a genius!! Thank you so much!
With you permission, I did a few changes in your code to also accept input artifacts and input parameters (Little rusty that part, but I did it in a rush haha), and put it in a repo so anyone can used it as a example (I'm planning to put it in a medium article explaining how to implement CI/CD in Vertex Pipelines and of course I will mention you :D)
In kfp 2.x.x, a few things have changed regarding the executor manipulation. Here is an updated version of the solution proposed in https://github.com/kubeflow/pipelines/issues/6116#issuecomment-1059174206
from kfp import dsl
from kfp.dsl.executor import Executor
import argparse
import json
parser = argparse.ArgumentParser()
parser.add_argument("--executor_input", type=str, required=True) # --executor_input instead of --executor-input in 2.x.x
args = parser.parse_args()
# Load executor input
executor_input = json.loads(args.executor_input)
# Create executor
executor = Executor(executor_input, lambda x: x)
# Get output artifacts
metrics: dsl.Metrics = executor.get_output_artifact('metrics') # functions of executor can be used directly in 2.x.x
model: dsl.Model = executor.get_output_artifact('model')
# Log some metric
metrics.log_metric("accuracy", 0.9)
# Save model
with open(model.path, "w") as f:
f.write("data")
# Save outputs
executor.write_executor_output() # This method is no longer private
To fill in the placeholder values in the --executor_input
parameter, you should use {{$}}
instead of {executorInput: null}
. kfp has conveniently defined a class for that, here is an example of a call to the script above:
from kfp import dsl
# from kfp.dsl.placeholders import ExecutorInputPlaceholder # This import is no longer needed starting from kfp 2.5.0
@dsl.container_component
def custom_component_artifact_manipulation(
model: dsl.Output[dsl.Model],
metrics: dsl.Output[dsl.Metrics],
# Other inputs/outputs if needed
):
return dsl.ContainerSpec(
image="<your_base_image>",
command=[
"python",
"example_script.py", # example_script.py contains a code similar to the one in the above block
],
args=[
"--executor_input",
dsl.PIPELINE_TASK_EXECUTOR_INPUT_PLACEHOLDER,# before kfp 2.5.0, ExecutorInputPlaceholder()._to_string() should be used
# Your other arguments go here
],
)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
Using v2 SDK and Vertex Pipelines environment, is it possible to create a reusable component (i.e. manually write a
component.yaml
file) that consumes and/or generates the new Artifact types such as Model, Metrics, Dataset, etc.?My understanding of these Artifact types is that they are a value/path/reference along with associated metadata. When passing or consuming these in a non-Python-based component, I can only reference or generate an Artifact's path and nothing else it seems.
For example, in the v1 SDK, it was possible to generate metrics that could be visualized by just by dumping a JSON object to the given output path. This allowed the possibility of using non-Python-based components to generate metrics and other metadata.
Is such a thing possible in v2/Vertex Pipelines? If not, is it on the roadmap or is the recommendation to port all components to lightweight Python components?