Open bhlarson opened 5 years ago
hi @bhlarson , the current MLMD API provides high level template queries such as listing, filtering by type, id, etc. It also provides low-level graph traversal primitives, such as get artifacts from a context, get executions by events. These APIs give you full access to the underlying data model.
For example, Show a DAG of all related executions and their input and output artifacts of a context
, to do this, given the context id, you can start traversal to get all executions and artifacts, with get_executions_by_context
, get_artifacts_by_context
to get related nodes in the DAG, and get_events_by_execution_ids
, get_events_by_artifact_ids
to look for the related edges of the DAG. Also see some examples and util functions of using the MLMD API to power the notebooks used in the tfx tutorial, github repo.
There are plans and discussions to declarative query language layers for MLMD. it will definitely make the interactions simpler instead of using low-level primitives. It is not available yet. We also welcome use cases, thoughts and contributions! :)
It might be very useful to be able to query executions or artifacts based on values of some property.
We need to a way to look up Executions by cache keys.
Agreed with OP. This is completely undocumented. What @hughmiao is out of scope, there is simply no documentation or examples on how to view artifacts, for example. I don't want to browse my cloud storage bucket every time I want to manually inspect an output artifact, for example.
Powering TFX is one good thing, being able to actually use it to view artifacts and store one's custom ones is another.
As the Product Manager for MLMD, I agree with OP, in a sense. Right now the MLMD querying language's primary ~user is during orchestration: If we don't write it down during training, there won't be anything there to query.
I want to get MLMD to the point where there's a lightweight, composable query language.
(caveat: this is not a roadmap or RFC or spec. An example of where we are headed)
{
query TFXModel(id: "aslkaj34LJ3") {
name
# If sub-querying for a type, a default graph QL does one-hop queries.
# We would *extend* this to -*> queries along the training DAG.
Training: {
duration
# a recursive graph QL pattern means the
# -*> graph query could occur inside the subquery or...
Dataset: {
name
size
}
}
# … the -*> query could occur outside the Training subquery
Dataset: {
name
Size
}
}
}
Another potential way to think about "lightweight and composable" with a different ux:
pipeline = mlmd.get(type=context, id="my_pipeline_id")
model = mlmd.get(type=Model, in=pipeline)
sibling_models = model.getAncestor(type=Dataset).getDescendants(type=Model)
The big puzzles we are iterating through that make this a lot more work than my former-startup-engineer head would quickly assume include (but aren't limited to):
It might be very useful to be able to query executions or artifacts based on values of some property.
We need to a way to look up Executions by cache keys.
Any update on this? Or does KFP implement query artifacts/executions by properties or custom properties on top of mlmd?
Thanks.
As the Product Manager for MLMD, I agree with OP, in a sense. Right now the MLMD querying language's primary ~user is during orchestration: If we don't write it down during training, there won't be anything there to query.
I want to get MLMD to the point where there's a lightweight, composable query language.
(caveat: this is not a roadmap or RFC or spec. An example of where we are headed)
{ query TFXModel(id: "aslkaj34LJ3") { name # If sub-querying for a type, a default graph QL does one-hop queries. # We would *extend* this to -*> queries along the training DAG. Training: { duration # a recursive graph QL pattern means the # -*> graph query could occur inside the subquery or... Dataset: { name size } } # … the -*> query could occur outside the Training subquery Dataset: { name Size } } }
Another potential way to think about "lightweight and composable" with a different ux:
pipeline = mlmd.get(type=context, id="my_pipeline_id") model = mlmd.get(type=Model, in=pipeline) sibling_models = model.getAncestor(type=Dataset).getDescendants(type=Model)
The big puzzles we are iterating through that make this a lot more work than my former-startup-engineer head would quickly assume include (but aren't limited to):
- all the different training DAG structures we have to support
- getting all the ML infras (several internal custom such infrastructures) built on us to agree enough on overlap on what "model", "dataset", etc. mean
- supporting multiple backends.
- some large internal scale performance requirements
Any progress or plans on this? Thanks.
The list of "Functionality Enabled by MLMD" implies the ability to query MLMD. For example "List all Artifacts of a specific type", "query context of workflow runs", and "Show a DAG of all related executions and their input and output artifacts of a context" described by Functionality Enabled by MLMD From the API, I am unclear how to accomplish this functionality without interacting directly with the database which does not conform to your design.
Could you provide examples of how to achieve this functionality?
Specifically, I would like to identify artifacts, executions events, or context through a query of their properties and then retrieve related context, executions and contexts related by the directed acyclic graph (DAG) related to the queried results.
Is this possible? How can it be achieved through the API?
metadata_store.py provides the ability to retrieve an entire list (e.g. get_executions) or a single item (e.g. get_contexts_by_id) but I am unclear how to achieve the stated ml-metadata through this interface.