kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.6k stars 1.62k forks source link

How to output Artifacts (Model, Metrics, Dataset, etc.) without using Python-based component? #6116

Closed parthmishra closed 3 months ago

parthmishra commented 3 years ago

Using v2 SDK and Vertex Pipelines environment, is it possible to create a reusable component (i.e. manually write a component.yaml file) that consumes and/or generates the new Artifact types such as Model, Metrics, Dataset, etc.?

My understanding of these Artifact types is that they are a value/path/reference along with associated metadata. When passing or consuming these in a non-Python-based component, I can only reference or generate an Artifact's path and nothing else it seems.

For example, in the v1 SDK, it was possible to generate metrics that could be visualized by just by dumping a JSON object to the given output path. This allowed the possibility of using non-Python-based components to generate metrics and other metadata.

Is such a thing possible in v2/Vertex Pipelines? If not, is it on the roadmap or is the recommendation to port all components to lightweight Python components?

zijianjoy commented 3 years ago

Taking HTML visualization as an example, the usage on V2 will look like:

@component 
def write_html(html_artifact: Output[HTML]):
      html_content = '<!DOCTYPE html><html><body><h1>Hello world</h1></body></html>'
      with open(html_artifact.path, 'w') as f:
           f.write(html_content)

Here you specify Output[] with artifact type HTML. Similarly, you can do it for Model, Metrics, etc. For example: https://github.com/kubeflow/pipelines/blob/master/samples/test/metrics_visualization_v2.py#L28

Does it work to your usecase?

cc @chensun

parthmishra commented 3 years ago

Taking HTML visualization as an example, the usage on V2 will look like:


@component 

def write_html(html_artifact: Output[HTML]):

      html_content = '<!DOCTYPE html><html><body><h1>Hello world</h1></body></html>'

      with open(html_artifact.path, 'w') as f:

           f.write(html_content)

Here you specify Output[] with artifact type HTML. Similarly, you can do it for Model, Metrics, etc.

For example: https://github.com/kubeflow/pipelines/blob/master/samples/test/metrics_visualization_v2.py#L28

Does it work to your usecase?

cc @chensun

I'm aware that you can do this when creating a component with the v2 decorator, I'm asking if it's possible to do so in a component that was not generated with the v2 decorator. Like a completely separate component that is loaded in by its YAML definition (perhaps even a different language)

chensun commented 3 years ago

Hi @parthmishra,

At this moment, it would be quite challenging for a user to replicate the supports for Input[Model], Output[Metrics], etc. in their custom container.

Here's a sample code of what the container interface would look like using the v2 @component decorator: https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/v2/compiler_cli_tests/test_data/lightweight_python_functions_v2_pipeline.json#L143-L149 (The Python code is inlined in the container commands, but they could be moved to inside the container).

If you were able to inspect that code sample, you would find that, other than the user code, the code also contains the entire code from following files: https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/components/executor.py https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/components/executor_main.py https://github.com/kubeflow/pipelines/blob/d69b6ae82a8c4afcf4bd3e7d444089302ba23e28/sdk/python/kfp/dsl/io_types.py

So technically, you could implement your own version following this code. But we are not expecting users to do so.

It is on our roadmap that we would help users include such code in their own components by packaging and installing the code into their custom container -- assuming the container is Python based. For non-Python based container, we would document the expected interface so that users can follow that to implement their own.

nicokuzak commented 3 years ago

@chensun following from this, is it possible to pass something like a pandas DataFrame (as a csv file probably) throughout custom components? For example, let's assume there are four components that all do different types of preprocessing for our data; how can we pass the data through without specifying an outside filepath (i.e. string that is a GCS path)?

parthmishra commented 3 years ago

@chensun

Thanks for the explanation, I think the v2 SDK docs for "regular" component building should state that these Artifact types are not useable and that users wishing to implement these inputs/outputs should instead write them using Python-function based components. The current docs are misleading in this regard and make it seem like there is equal feature parity between the two methods of implementing Components.

chensun commented 3 years ago

@chensun following from this, is it possible to pass something like a pandas DataFrame (as a csv file probably) throughout custom components? For example, let's assume there are four components that all do different types of preprocessing for our data; how can we pass the data through without specifying an outside filepath (i.e. string that is a GCS path)?

@nicokuzak Yes, you can pass a DataFrame as a file throughout custom components. And you don't need to provide a GCS path yourself, the system generates such a path. If you write component.yaml, you need {outputPath: output_name} placeholder, or if you write Python-function based component, you need to type annotate the output like such: output_name: OutputPath('CSV'). At runtime, your code should expect a local file path, and you need to dump the DataFrame object into the file. The downstream component can take such output as an input using {inputPath: input_name} or input_name: InputPath('CSV'). Then you can read from the file, and load it into a DataFrame object.

On Vertex Pipelines, the local path is backed by GCS Fuse, meaning it maps to a GCS location like gs://my-bucket/some-blob, whatever content you write to the local file will be "synced" to the GCS location.

chensun commented 3 years ago

@chensun

Thanks for the explanation, I think the v2 SDK docs for "regular" component building should state that these Artifact types are not useable and that users wishing to implement these inputs/outputs should instead write them using Python-function based components. The current docs are misleading in this regard and make it seem like there is equal feature parity between the two methods of implementing Components.

@parthmishra Thank you for your feedback. Agree that our current doc isn't in a good shape. We will continuously improve our documentation. Meanwhile, we're designing the next generation of component authoring experience that could make these Artifacts types available for custom container components as well.

vemqar commented 3 years ago

I would like to also add that it would be nice to be able to use the v2 Metrics capabilities with the manually written component.yaml approach. Since this is not possible, I have to create op from python functions directly if I want to use the Metrics capability (for display in Vertex AI). As mentioned here in v1, this was possible by dumping to a specified JSON file and the Metrics were read from there. If there are any workarounds please let us know, thank you.

Preferred File structure example as noted: -build.sh -component.yaml -Dockerfile -src/train.py

This structure is optimal for more complex code and portability, however, without Metrics capability, cannot use this approach if Metrics needed.

parthmishra commented 3 years ago

I would like to also add that it would be nice to be able to use the v2 Metrics capabilities with the manually written component.yaml approach. Since this is not possible, I have to create op from python functions directly if I want to use the Metrics capability (for display in Vertex AI). As mentioned here in v1, this was possible by dumping to a specified JSON file and the Metrics were read from there. If there are any workarounds please let us know, thank you.

Preferred File structure example as noted: -build.sh -component.yaml -Dockerfile -src/train.py

This structure is optimal for more complex code and portability, however, without Metrics capability, cannot use this approach if Metrics needed.

@vemqar I could be wrong, but If you're using Python for your component, you can use the component decorator to output a component.yaml definition file (and specify custom base image as output of build.sh). The function you decorate can essentially just be used for serializing inputs, passing them to the rest of your code (e.g. src/train.py) and then serializing the outputs. Not ideal as you clutter up the component.yaml file with in-lined code, but I don't see why it wouldn't work.

Perhaps you could also just directly use the KFP SDK to serialize Artifacts, essentially what the in-lined code is doing for Python-function based components anyways.

vemqar commented 3 years ago

Thanks @parthmishra for the advice. I did try to essentially import the Metrics class definition into my src/ code , but it doesn't work because it needs to be initialized. I realized that the kfp Artifact needs to be initialized with GCP paths which will specify where the Metrics Artifacts will be stored. So what I understand, essentially I would have to myself initialize Artifact with a uri path then call Metrics . If you have a working example though, would appreciate it. I did inspect the component.yaml of a function with Metrics Output and didn't see an obvious way to integrate that into a custom written component file.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jordyantunes commented 2 years ago

I found a somewhat hacky solution to this problem. I'm using Kubeflow's Executor class (which is the one used by function-based components) to easily instantiate the Artifact objects. I could iterate through executor_input and create all the objects myself, but I think it's a lot more convenient to use Executor, even if I'm not using it for what is was designed.

You need to include {executorInput: null} in your component.yaml file and your python script would look something like this:

from kfp.v2.components.executor import Executor
from kfp.v2.dsl import Metrics, Model
import argparse
import json

parser = argparse.ArgumentParser()
parser.add_argument("--executor-input", type=str, required=True)

args = parser.parse_args()

# carrega argumentos do executor
executor_input = json.loads(args.executor_input)

# carrega configuracoes
executor = Executor(executor_input, lambda x: x)

# obtem objetos Kubeflow
metrics:Metrics = executor._output_artifacts['metrics']
model:Model = executor._output_artifacts['model']

# log de metricas
metrics.log_metric("accuracy", 0.9)

# salva modelo
with open(model.path, "w") as f:
    f.write("data")

# salva saidas
executor._write_executor_output()

I'm also attaching all the files necessary to run this example, as well as some screenshots to show you that it works (at least on Vertex AI pipelines). Just so we don't have to build and publish a docker image, I included the python script in the component.yaml file.

Code:

code.zip

Screenshots:

Captura de Tela 2022-03-04 às 10 38 54 Captura de Tela 2022-03-04 às 10 39 07

Edit: after commenting I realized what I did was kind of what was suggested in https://github.com/kubeflow/pipelines/issues/6116#issuecomment-885506281 . So I just wanted to give them credits.

juansebashr commented 2 years ago

@jordyantunes Men, you're a genius!! Thank you so much!

With you permission, I did a few changes in your code to also accept input artifacts and input parameters (Little rusty that part, but I did it in a rush haha), and put it in a repo so anyone can used it as a example (I'm planning to put it in a medium article explaining how to implement CI/CD in Vertex Pipelines and of course I will mention you :D)

https://github.com/juansebashr/VertexPipelinesCICD

clemgaut commented 9 months ago

In kfp 2.x.x, a few things have changed regarding the executor manipulation. Here is an updated version of the solution proposed in https://github.com/kubeflow/pipelines/issues/6116#issuecomment-1059174206

from kfp import dsl
from kfp.dsl.executor import Executor
import argparse
import json

parser = argparse.ArgumentParser()
parser.add_argument("--executor_input", type=str, required=True) # --executor_input instead of --executor-input in 2.x.x

args = parser.parse_args()

# Load executor input
executor_input = json.loads(args.executor_input)

# Create executor
executor = Executor(executor_input, lambda x: x)

# Get output artifacts
metrics: dsl.Metrics = executor.get_output_artifact('metrics') # functions of executor can be used directly in 2.x.x
model: dsl.Model = executor.get_output_artifact('model')

# Log some metric
metrics.log_metric("accuracy", 0.9)

# Save model
with open(model.path, "w") as f:
    f.write("data")

# Save outputs
executor.write_executor_output() # This method is no longer private

To fill in the placeholder values in the --executor_input parameter, you should use {{$}} instead of {executorInput: null}. kfp has conveniently defined a class for that, here is an example of a call to the script above:

from kfp import dsl
# from kfp.dsl.placeholders import ExecutorInputPlaceholder # This import is no longer needed starting from kfp 2.5.0

@dsl.container_component
def custom_component_artifact_manipulation(
    model: dsl.Output[dsl.Model],
    metrics: dsl.Output[dsl.Metrics],
    # Other inputs/outputs if needed
):
    return dsl.ContainerSpec(
        image="<your_base_image>",
        command=[
            "python",
            "example_script.py", # example_script.py contains a code similar to the one in the above block
        ],
        args=[
            "--executor_input",
            dsl.PIPELINE_TASK_EXECUTOR_INPUT_PLACEHOLDER,# before kfp 2.5.0, ExecutorInputPlaceholder()._to_string() should be used
           # Your other arguments go here
        ],
    )
github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.