Open sm-hawkfish opened 4 years ago
Thanks @sm-hawkfish
Interesting topic. We were discussing internally whether YAML or decorated Python functions are a better way to provide a shared library of components.
We are running into a couple of challenges with YAML files:
On the other hand, as said by others, consuming the YAML file from a code repository is much easier that consuming them from e.g. a Python package in PyPI.
A feature that generates the YAML specifications from Python functions would potentially offer the flexibility of Python functions and the easy of consumption of text files. Also CI/CD would be very flexible. So definitely interested.
Thanks for summarizing all of these in an issue!
/cc @Ark-kun @numerology Who have the best knowledge.
Commenting on some of the statements for now.
By contrast, the YAML Component specification requires engineers to learn a new syntax and is generally more verbose than the Python equivalent
I think that the component.yaml is the minimal description of the component. Your example shows that it has less lines than the python version. It also supports more features than the legacy @component
decorator.
(especially when it comes to writing custom type definitaions using the OpenAPI Schema)
Do you have any use for those OpenAPI schemas? Those schemas as well as kfp.dsl.types
are mostly deprecated.
Why would you use input1: kfp.dsl.types.String()
instead of just pythonic input1: str
? This will result in more compact and more supported component.yaml
.
Components vs. ContainerOp
:
Component consists of interface (input and output specifications) and implementation (currently container
and graph
are supported).
ContainerOp
objects are not really components. They are semi-resolved Task objects. They do not contain the whole information. Given ContainerOp
produced by giving arguments to a component, you cannot restore the component back. The information is lost. For example, ContainerOp
does not really have concept of inputs. When you pass arguments to dummy_op
, they're injected directly in the command which cannot be restored back. The @component
decorator was added as a hack to try to preserve some of that information, but it only preserves the input types. Other information is still lost. This is why we really discourage users from creating ContainerOp
objects directly - it takes the same amount of effort to write component.yaml
file which gives a real reusable component that can be shared between pipelines and users.
I think we already support the feature that you want. The structures in kfp.components.structures
allows you to build your component specification using python. The code has about the same size as the ContainerOp
instantiation while creating a real component. The specification can then be saved to component.yaml
.
from kfp.components.structures import *
component_spec = ComponentSpec(
name='Dummy op',
description='Dummy component for illustrative purposes',
inputs=[
InputSpec(name='input1', type='String'),
InputSpec(name='input2', type='GCSPath'),
],
outputs=[
OutputSpec(name='output1', type='GCSPath'),
],
implementation=ContainerImplementation(container=ContainerSpec(
image="dummy-image",
command=[
"python", "runner.py",
"--input1", InputValuePlaceholder('input1'),
"--input2", InputPathPlaceholder('input2'),
"--output1", OutputPathPlaceholder('output1'),
],
))
)
component_spec.save('dummy.component.yaml')
What do you think?
Thank you for the quick and detailed reply @Ark-kun ! The code snippet you provided answers my initial question, although I now have a follow-up :)
To take a step back, a large part of my motivation for wanting to write the Components in Python is so that I could extend the types in kfp.dsl.types
with my own types. By using Python, we could use a variety of existing tools like mypy and pydantic to validate Pipeline parameters before submitting. The goal of this would be to cut down on the iterative loop of development.
Additionally, these tools would allow us to easily define complex types, so that we could pass an object of related Pipeline params to the pipeline function, rather than denormalizing them into lots of Pipeline params with primitive types. For example, suppose I want to create a component to launch a Katib Hyperparameter tuning job, it seems convenient to have a single pipeline parameter katib-objective
, which is an object containing all of the fields outlined in the ObjectiveSpec
detailed here and here. And/or each Component could have its own complex type, to make the Pipeline function input signature cleaner.
I was actually just preparing some code snippets to open a separate Github Issue demonstrating how the kfp.dsl.types
system could be revamped using something like pydantic
, but I did not know that that file is considered largely deprecated. I'll share the basic idea here:
from typing import List
from typing import NamedTuple
import kfp
from kfp import dsl
from pydantic import BaseModel
from pydantic import Field
class SimpleParam(BaseModel):
field1: str
field2: float
class ComplexParam(BaseModel):
field1: str = Field(..., regex="^gs://.*$")
field2: float = Field(..., ge=0, le=1)
field3: List[int]
Component1Outputs = NamedTuple("Component1Outputs", [("output", SimpleParam)])
@dsl.component
def component1(arg: ComplexParam) -> Component1Outputs:
output = "/tmp/python_dummy_op/output1_path"
return dsl.ContainerOp(
name="Dummy op",
image="dummy-image",
command=["python3", "runner.py", "--input", arg, "--output", output],
file_outputs={"output": output},
)
@dsl.component
def component2(arg1: ComplexParam, arg2: SimpleParam):
return dsl.ContainerOp(
name="Dummy op",
image="dummy-image",
command=["python3", "runner.py", "--input1", arg1, "--input2", arg2],
)
@dsl.pipeline()
def my_pipeline(arg: ComplexParam):
component1_task = component1(arg=arg)
component2(arg1=arg, arg2=component1_task.outputs["output"])
if __name__ == "__main__":
arg = ComplexParam(
field1="gs://my_bucket/hello_world.txt", field2=0.5, field3=["1", "2"]
)
client = kfp.Client()
run = client.create_run_from_pipeline_func(
my_pipeline, arguments={"arg": arg}, run_name="Test custom types"
)
I removed some hacky code from the above example that makes the pydantic Models backwards compatible with the kfp.dsl.types.BaseType
, which I'd be happy to share if you are interested. Fully implemented, the code would allow for the following workflow:
mypy
statically validates that the arg
being passed to the pipeline function is actually a ComplexParam
pydantic
validates that the arg
conforms to the ComplexParam
Model schema (and performs common-sense type casting automatically)component1
outputs a SimpleParam
and component2
expects a SimpleParam
.There are still some open questions:
The reason I said your response prompted a follow up question is that it does not look like InputSpec
could accept the complex types that I created in the above code snippet. I'd love to hear your input on this (and would be happy to open a separate Issue if you would prefer to talk about types/validation elsewhere)
Hi @Ark-kun, I know that you are very busy with other Issues and Pull Requests, but want to keep this on your radar. Are there ideas or concepts in my previous post that you would be interested in discussing further?
I want to note some core KFP aspects:
On the low conceptual level KFP orchestrates containerized command-line programs. The "command-line programs" is an important part. It helps users understand the limitations and the solutions. KFP does not orchestrate python or Java classes. KFP does not pass in-memory object between running programs. KFP passes data, serialized as bytes or strings.
KFP needs to be portable, language-agnostic and platform-agnostic. The users can still use python-specific serialization formats like Pickle, but they should understand that this has negative portability implications - a Java-based downstream component won't be able to rea the pickled data.
KFP components are described by the ComponentSpec class and the corresponding component.yaml serialization. This is the source of truth regarding components. All other component creation methods build on that. Any new high-level component feature should be built on top of that structure. The structure is pretty flexible, so even it's usually not a problem. For example, python-based components are still built on top of ComponentSpec and ContainerSpec.
Let's start with untyped world. Components exchange pieces of data blobs. Why would the user want to specify the types for their component inputs and outputs? I see several reasons:
1) Compile-time reference argument compatibility checking. This feature prevents passing outputs of one type to an input with another type.
2) Compile-time constant argument value checking. This feature prevents passing objects of one type to an input with another type.
3) Visualization. The UX might visualize data of certain types based on the type information.
KFP components support type specifications. The type specification is essentially a dict (and the values can also be strings, dicts or lists). The system is very flexible and allows specifying arbitrary user types. (You should not confuse types and objects.)
KFP already supports this. If both input and argument have types, then the system immediately chacks that the types are the same when you pass output to input. There is no need for any additional tools. Currently the type compatibility check simply compares the two type specifications (dicts).
KFP has some support for this. There is a limited set of types (str/String, int/Integer, float/Float, bool/Boolean, list/JsonArray, dict/JsonObject) for which the constant argument values are checked against the input type and serialized automatically. Values of other types must be manually serialized before passing them to the component.
To take a step back, a large part of my motivation for wanting to write the Components in Python is so that I could extend the types in kfp.dsl.types with my own types.
I think that you do not need kfp.dsl.types to declare your own custom types.
You can use arbitrary type name or type structure (dict). You can even use an object that has a to_dict
method although we do not support this (but this is how the types in kfp.dsl.types are implemented - they just return dict. That's it.).
By using Python, we could use a variety of existing tools like mypy and pydantic to validate Pipeline parameters before submitting.
KFP already validates the types even before submission.
If both input and output are typed, when you try to pass an output to an incompatible input, you'll get error. No need to integrate any external tools.
Additionally, these tools would allow us to easily define complex types, so that we could pass an object of related Pipeline params to the pipeline function, rather than denormalizing them into lots of Pipeline params with primitive types.
I think this is a misconception. Just because you can use some python class as KFP type, it does not mean you can just pass an object of that class to some component. KFP orchstrates containerized command-line programs. You cannot pass in-memory objects. At some point they must be serialized and sent over the network as bytes and then maybe deserialized by some code.
In KFP team we try to keep the API surface of the SDK minimal, so we only support automatic serialization of 6 primitive types. Everything else must be serialized by the pipeline code and deserialized by the component code. Remember that KFP runs arbitrary containerized command-line programs. In general the containers do not have KFP SDK or even Python installed. A Java program won't automatically understand a Python memory object.
each Component could have its own complex type, to make the Pipeline function input signature cleaner.
Sure, component code already can do whatever it wants and the component author can specify any structure describing the type.
The SDK is not peeking in that though. Type specifications are opaque identifiers and containers are black boxes.
For example, suppose I want to create a component to launch a Katib Hyperparameter tuning job, it seems convenient to have a single pipeline parameter katib-objective, which is an object containing all of the fields outlined in the ObjectiveSpec detailed here and here.
You can easily do that. Declate a single input. Optionally, give it some type like katib.ObjectiveSpec
. When building a pipeline, construct an object of that type and serialize it to string (e.g. as JSON) before passing to the component. (If using JSON you might call the type {JsonObject: {data_type: katib.ObjectiveSpec}}
, but that won't change much).
it does not look like InputSpec could accept the complex types that I created in the above code snippet.
I think it can. You can convert your complex type specifications to a JSON-like structure and use it directly. Even is InputSpec only supported type name (a single string), you could still serialize an arbitrary type specification to that string (e.g. using JSON).
Comments on some of the samples:
def component2(arg1: ComplexParam, arg2: SimpleParam):
return dsl.ContainerOp(
name="Dummy op",
image="dummy-image",
command=["python3", "runner.py", "--input1", arg1, "--input2", arg2],
)
...
arg = ComplexParam(
field1="gs://my_bucket/hello_world.txt", field2=0.5, field3=["1", "2"]
)
So, what is the resolved command-line supposed to be?
Component1Outputs = NamedTuple("Component1Outputs", [("output", SimpleParam)])
@dsl.component
def component1(arg: ComplexParam) -> Component1Outputs:
output = "/tmp/python_dummy_op/output1_path"
return dsl.ContainerOp(
name="Dummy op",
image="dummy-image",
command=["python3", "runner.py", "--input", arg, "--output", output],
file_outputs={"output": output},
)
I'm writing a Java-based component. How can I read the output of your component?
There are still some open questions: Where the serialization of these complex types should occur Relationship between this work and any work to validate / submit Pipelines from the Web UI (note that Pydantic models do define an OpenAPI Schema that could be shipped elsewhere).
Yes. These are the questions that have shaped the SDK's API surface regarding types. This is why SDK only supports serialization of 6 primitive types and everything else is the responsibility of the pipeline and component authors.
Where the serialization of these complex types should occur
Serialization is custom code. There are only two places where custom code is executed - inside the launched containers in the cloud and on the pipeline author's/submitter's machine. Since the serialization must occur before the container can be launched, this only leaves the pipeline author's/submitter's machine. The complex objects must be serialized before the pipeline can be compiled or submitted. And this is what the SDK expects at this moment.
Relationship between this work and any work to validate / submit Pipelines from the Web UI (note that Pydantic models do define an OpenAPI Schema that could be shipped elsewhere).
There were some plans for this. It was the reason why those openapi_validator
schemas were added. The SDK makes the whole ComponentSpec available to the UX. However the feature has not been implemented on the UX side. It would be useful to have.
@sm-hawkfish What do you think?
Hi @Ark-kun , I apologize for the extended delay -- I wanted to make sure I had a chance to review your comments and go through the code-base in more detail before responding.
For one, I have come to agree with you that using func_to_component_spec
is not really a "shortcut", since it's a similar amount of work to writing the Component Spec in YAML. In your response, you also mentioned some aspects of the KFP methodology (specifically in regards to components being containerized CLI programs) that I never meant to question, so to clarify: the goal of this Issue is only to discuss possible improvements to the Python SDK to make it easier for Component authors to create new re-usable components using idiomatic python and for Pipeline authors to get rich, compile-time type validation.
To that end, I found the existing python decorator for lightweight components to be very inspirational: as you well know, create_component_from_func
uses type annotations in the signature of the component function in order to generate a component.yaml
specification. This strikes me as a much better than either writing the component.yaml
directly or writing a func_to_component_spec
, since it is created from the component implementation itself.
For all of its benefits, there are a few drawbacks to create_component_from_func
, as it pertains to creating re-usable components:
component.yaml
, limiting the complexity of the component implementation (and requiring users to make all imports within the function)component.yaml
are limited in their ability to validate user input, since they are string names like 'JsonObject', as opposed to OpenAPI schema definitions.In an attempt to supercharge create_component_from_func
, I have made some local modifications to the KFP Python SDK and incorporated the libraries Pydantic and Typer. I will provide some snippets below on how this looks and am happy to provide additional detail (or contribute) if you are interested in the approach.
I have taken as an example a KFP Component that submits a Katib Hyperparameter Tuning experiment:
We create a file src/katib_specifications.py
, with Pydantic models that mirror the Katib specs defined here
from enum import Enum
from typing import List
from pydantic import BaseModel
class ObjectiveType(str, Enum):
unknown = ""
minimize = "minimize"
maximize = "maximize"
class ObjectiveSpec(BaseModel):
type: ObjectiveType
goal: float = None
objectiveMetricName: str
additionalMetricNames: List[str] = None
There are many more specs, but this should give a feel for the syntax.
From there, we create src/component.py
which contains the function that will be the component entrypoint:
from typing import Dict
from typing import List
from typing import NamedTuple
from typing import Union
import typer
from .katib_specifications import ObjectiveSpec
class Outputs(NamedTuple):
best_hyperparameters = Dict[str, Union[str, float, int]]
def katib_hyperparameter_tuning(
data_dir: str = typer.Option(
..., help="The GCS directory containing training data"
),
objective_spec: ObjectiveSpec = typer.Option(
...,
help="The Katib Objective to optimize",
),
) -> Outputs:
"""
Distributed Hyperparamter Tuning (and Neural Architecture Search) using Kubeflow Katib. The
upstream documentation on available hyperparameter search algorithms is available here:
https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#search-algorithms-in-detail
"""
# Implementation - `objective_spec` can be used like a namedtuple or turned into a dictionary via objective_spec.dict()
best_hyperparameters = {"max_depth": 4}
return best_hyperparameters
if __name__ == "__main__":
typer.run(katib_hyperparameter_tuning)
The idea is that Typer can create a CLI for you automatically using the type annotations in the function. The input provided to the component will be JSON strings, so the KFP infrastructure doesn't need to know or care about this:
python -m src.component --data-dir gs://my-bucket/my-training-data/ --objective-spec '{"type": "maximize", "objectiveMetricName": "roc_auc", "additionalMetricNames": ["accuracy"]}'
And Typer will also do the work of casting the input types into the type declared by the annotation. In the case above, something like this would be run by Typer behind the scenes:
objective_spec = ObjectiveSpec.parse_raw(
'{"type": "maximize", "objectiveMetricName": "roc_auc", "additionalMetricNames": ["accuracy"]}'
)
I am cheating here a little bit, as I did need to make slight modifications to Typer in order to parse these JSON inputs into Pydantic models, which is outlined in this issue
Since Pydantic Models can output an OpenAPI schema, I just needed to make some adjustments to extract_component_interface and annotation_to_type_struct
in the KFP SDK in order to turn the type annotations into OpenAPI schema definitions.
Here's an example of how to get the schema:
from .katib_specifications import ObjectiveSpec
print(json.dumps(ObjectiveSpec.schema(), indent=2))
{
"title": "ObjectiveSpec",
"type": "object",
"properties": {
"type": {
"$ref": "#/definitions/ObjectiveTypes"
},
"goal": {
"title": "Goal",
"type": "number"
},
"objectiveMetricName": {
"title": "Objectivemetricname",
"type": "string"
},
"additionalMetricNames": {
"title": "Additionalmetricnames",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"type",
"objectiveMetricName"
],
"definitions": {
"ObjectiveTypes": {
"title": "ObjectiveTypes",
"description": "An enumeration.",
"enum": [
"",
"minimize",
"maximize"
],
"type": "string"
}
}
}
The end result is the following component specification:
name: Katib hyperparameter tuning
description: |-
Distributed Hyperparamter Tuning (and Neural Architecture Search) using Kubeflow Katib. The
upstream documentation on available hyperparameter search algorithms is available here:
https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#search-algorithms-in-detail
inputs:
- name: data_dir
type:
String:
openapi_schema_validator: {type: string}
description: The GCS directory containing training data
- name: objective_spec
type:
src.katib_specifications.ObjectiveSpec:
openapi_schema_validator:
type: object
properties:
type: {$ref: '#/definitions/ObjectiveTypes'}
goal: {title: Goal, type: number}
objectiveMetricName: {title: Objectivemetricname, type: string}
additionalMetricNames:
title: Additionalmetricnames
type: array
items: {type: string}
required: [type, objectiveMetricName]
definitions:
ObjectiveTypes:
title: ObjectiveTypes
description: An enumeration.
enum: ['', minimize, maximize]
type: string
description: The Katib Objective to optimize.
outputs:
- name: best_hyperparameters
type:
Dict[str, Union[str, float, int]]:
openapi_schema_validator:
type: object
additionalProperties:
anyOf:
- {type: string}
- {type: number}
- {type: integer}
implementation:
container:
image: DUMMY_IMAGE
args:
- --data-dir
- {inputValue: data_dir}
- {inputValue: objective_spec}
- '----output-paths'
- {outputPath: best_hyperparameters}
You can see I kept the key openapi_schema_validator
similar to kfp.dsl.types so that existing features like DSL Type Checking would continue to work as expected.
Generating the OpenAPI schema has a couple of nice benefits:
I wrote a very simple script to parse the component.yaml
file and generate a static documentation site using redoc
This provides a nice reference to data scientists who are getting familiar with the inputs that each component expects.
In addition, I wrote a small function validate_pipeline
that will validate the inputs against the OpenAPI schema in the Component Spec.
The validation code looks something like:
from openapi_schema_validator import OAS30Validator
def validate_component_input(input_spec: InputSpec, input_arg: Any):
schema = list(input_spec.type.values())[0]["openapi_schema_validator"]
validator = OAS30Validator(schema)
validator.validate(input_arg)
This allows data scientists to get feedback on their pipeline arguments at compile time (note the objective spec type below is "maximization" instead of "maximize"):
import kfp
KATIB_OBJECTIVE_SPEC = {
"type": "maximization",
"objectiveMetricName": "roc_auc",
"additionalMetricNames": ["accuracy"],
}
component_store = ComponentStore()
hyperparameter_op = component_store.load_component("hyperparameter_tuning")
# Define a pipeline and create a task from a component:
@kfp.dsl.pipeline(
name="Train Model", description="Train model",
)
def my_pipeline(
data_dir=DATA_DIR,
katib_objective_spec=KATIB_OBJECTIVE_SPEC
):
hyperparameter_op(
data_dir=data_task.output,
objective_spec=katib_objective_spec
)
if __name__ == "__main__":
validate_pipeline(my_pipeline)
kfp.compiler.Compiler().compile(my_pipeline, "/tmp/pipeline.tar.gz")
When the user runs this script, they will get:
jsonschema.exceptions.ValidationError: 'maximization' is not one of ['', 'minimize', 'maximize']
Failed validating 'enum' in schema['properties']['type']:
OrderedDict([('title', 'ObjectiveTypes'),
('description', 'An enumeration.'),
('enum', ['', 'minimize', 'maximize']),
('type', 'string'),
('nullable', False)])
On instance['type']:
'maximization'
And can easily correct the typo before submitting the pipeline run to the cluster.
As stated above, I would be very happy to go into more detail on any of the above steps if you find aspects of this approach interesting.
And congratulations on the 1.0.0 release !
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/freeze
/lifecycle frozen
Hi everyone,
Using the kfp.dsl.component decorator to define KFP Components is a great user experience, since it leverages standard features of the Python language like type annotations and docstring inspection. By contrast, the YAML Component specification requires engineers to learn a new syntax and is generally more verbose than the Python equivalent (especially when it comes to writing custom type definitaions using the OpenAPI Schema)
That said, one advantage of the YAML spec that was raised in the Slack channel is around the ease of distributing a readily parseable file format to a variety of different client applications.
Since both formats have their pros and cons, the purpose of this issue is to discuss the value and feasibility of enhancing the KFP Python SDK to support generating the YAML specification from the Python DSL component definition. This would give users the best of both worlds by allowing them to define components comfortably in Python and to ship Component specifications to downstream clients in YAML.
By way of example, consider the following (equivalent) specifications of a
Dummy op
:When compiled into Pipelines, the Component Metadata is nearly identical:
Yields:
CC'ing @Bobgy and @eterna2 since they were involved in the Slack discussion as well (I was unable to find Lida Li's Github username).