google / ml-metadata

For recording and retrieving metadata associated with ML developer and data scientist workflows.
https://www.tensorflow.org/tfx/guide/mlmd
Apache License 2.0
612 stars 144 forks source link

What querying capabilities does ml-metadata provide #21

Open bhlarson opened 4 years ago

bhlarson commented 4 years ago

The list of "Functionality Enabled by MLMD" implies the ability to query MLMD. For example "List all Artifacts of a specific type", "query context of workflow runs", and "Show a DAG of all related executions and their input and output artifacts of a context" described by Functionality Enabled by MLMD From the API, I am unclear how to accomplish this functionality without interacting directly with the database which does not conform to your design.

Could you provide examples of how to achieve this functionality?

Specifically, I would like to identify artifacts, executions events, or context through a query of their properties and then retrieve related context, executions and contexts related by the directed acyclic graph (DAG) related to the queried results.

Is this possible? How can it be achieved through the API?

metadata_store.py provides the ability to retrieve an entire list (e.g. get_executions) or a single item (e.g. get_contexts_by_id) but I am unclear how to achieve the stated ml-metadata through this interface.

hughmiao commented 4 years ago

hi @bhlarson , the current MLMD API provides high level template queries such as listing, filtering by type, id, etc. It also provides low-level graph traversal primitives, such as get artifacts from a context, get executions by events. These APIs give you full access to the underlying data model.

For example, Show a DAG of all related executions and their input and output artifacts of a context, to do this, given the context id, you can start traversal to get all executions and artifacts, with get_executions_by_context, get_artifacts_by_context to get related nodes in the DAG, and get_events_by_execution_ids, get_events_by_artifact_ids to look for the related edges of the DAG. Also see some examples and util functions of using the MLMD API to power the notebooks used in the tfx tutorial, github repo.

There are plans and discussions to declarative query language layers for MLMD. it will definitely make the interactions simpler instead of using low-level primitives. It is not available yet. We also welcome use cases, thoughts and contributions! :)

Ark-kun commented 4 years ago

It might be very useful to be able to query executions or artifacts based on values of some property.

We need to a way to look up Executions by cache keys.

ntakouris commented 4 years ago

Agreed with OP. This is completely undocumented. What @hughmiao is out of scope, there is simply no documentation or examples on how to view artifacts, for example. I don't want to browse my cloud storage bucket every time I want to manually inspect an output artifact, for example.

Powering TFX is one good thing, being able to actually use it to view artifacts and store one's custom ones is another.

benmathes commented 3 years ago

As the Product Manager for MLMD, I agree with OP, in a sense. Right now the MLMD querying language's primary ~user is during orchestration: If we don't write it down during training, there won't be anything there to query.

I want to get MLMD to the point where there's a lightweight, composable query language.

(caveat: this is not a roadmap or RFC or spec. An example of where we are headed)

{
  query TFXModel(id: "aslkaj34LJ3") {
    name
    # If sub-querying for a type, a default graph QL does one-hop queries.
    # We would *extend* this to -*> queries along the training DAG.
    Training: {
      duration
      # a recursive graph QL pattern means the 
      # -*> graph query could occur inside the subquery or...
      Dataset: {
        name
        size
      }
    }
    # … the -*> query could occur outside the Training subquery
    Dataset: {
      name
      Size
    }
  }
}

Another potential way to think about "lightweight and composable" with a different ux:

pipeline = mlmd.get(type=context, id="my_pipeline_id")
model = mlmd.get(type=Model, in=pipeline)
sibling_models = model.getAncestor(type=Dataset).getDescendants(type=Model)

The big puzzles we are iterating through that make this a lot more work than my former-startup-engineer head would quickly assume include (but aren't limited to):

smthpickboy commented 3 years ago

It might be very useful to be able to query executions or artifacts based on values of some property.

We need to a way to look up Executions by cache keys.

Any update on this? Or does KFP implement query artifacts/executions by properties or custom properties on top of mlmd?

Thanks.

smthpickboy commented 3 years ago

As the Product Manager for MLMD, I agree with OP, in a sense. Right now the MLMD querying language's primary ~user is during orchestration: If we don't write it down during training, there won't be anything there to query.

I want to get MLMD to the point where there's a lightweight, composable query language.

(caveat: this is not a roadmap or RFC or spec. An example of where we are headed)

{
  query TFXModel(id: "aslkaj34LJ3") {
    name
    # If sub-querying for a type, a default graph QL does one-hop queries.
    # We would *extend* this to -*> queries along the training DAG.
    Training: {
      duration
      # a recursive graph QL pattern means the 
      # -*> graph query could occur inside the subquery or...
      Dataset: {
        name
        size
      }
    }
    # … the -*> query could occur outside the Training subquery
    Dataset: {
      name
      Size
    }
  }
}

Another potential way to think about "lightweight and composable" with a different ux:

pipeline = mlmd.get(type=context, id="my_pipeline_id")
model = mlmd.get(type=Model, in=pipeline)
sibling_models = model.getAncestor(type=Dataset).getDescendants(type=Model)

The big puzzles we are iterating through that make this a lot more work than my former-startup-engineer head would quickly assume include (but aren't limited to):

  • all the different training DAG structures we have to support
  • getting all the ML infras (several internal custom such infrastructures) built on us to agree enough on overlap on what "model", "dataset", etc. mean
  • supporting multiple backends.
  • some large internal scale performance requirements

Any progress or plans on this? Thanks.