kubeflow / metadata

Repository for assets related to Metadata.
Apache License 2.0
121 stars 69 forks source link

K8s Metadata watcher: Support Executions and Events #241

Closed jlewi closed 2 years ago

jlewi commented 3 years ago

/kind feature

I think it would be great if users could do the following

  1. Create a K8s resource (e.g. TFJob, PyTorchJab, KatibJob, K8s Job, Tekton TaskRun, Tekton PipelineRun, Argo, etc...)
  2. Attach annotations to the resource indicating input and output artifacts
  3. Apply the resource
  4. See the lineage graph in KF metadata
    • Artifacts should show up as artifacts
    • The batch resource (TFJob, TaskRun etc....) should be stored in MLMD as an exeuction
    • Executions and artifacts should be connected via MLMD events to form the graph.

What we have today

What we need todo

@karlschriek , @aronchick @Swikar @Ark-kun WDYT? Anyone have cycles to work on this?

issue-label-bot[bot] commented 3 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/front-end 0.55

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Swikar commented 3 years ago

I will look at it

Swikar commented 3 years ago

'Attach annotations to the resource indicating input and output artifacts'

Do we want replace artifact with executions or keep both? I would assume replace with executions, So that input artifact annotation use for finding purpose - correct?

Can I have more detail or example related how to set annotation and artifact, have we used somewhere I can have a look?

I have already modified logging as execution rather than artifacts.

Thanks

jlewi commented 3 years ago

Can I have more detail or example related how to set annotation and artifact, have we used somewhere I can have a look? https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/

I imagine we would want to define some schema for annotations to contain input/output schemas. These might just be a list of the json representations of some of the MLMetadata schemas for artifacts.

Do we want replace artifact with executions or keep both? I would assume replace with executions, So that input artifact annotation use for finding purpose - correct?

We want to keep both. The inputs and outputs should be stored as artifacts and then there should be an execution as well for the actual execution.

jlewi commented 3 years ago

So here's a more precise example.

The dataset schema is currently defined here https://github.com/kubeflow/metadata/blob/master/schema/alpha/artifacts/data_set.json

And an example is

{
            "annotations": {
                "mylabel": "l1",
                "tag": "data-set"
            },
            "apiversion": "v1",
            "category": "artifact",
            "create_time": "2018-11-13T20:20:39+00:00",
            "description": "a example data",
            "id": "123",
            "kind": "data_set",
            "name": "mytable-dump",
            "namespace": "kubeflow.org",
            "owner": "owner@my-company.org",
            "uri": "file://path/to/dataset",
            "version": "v1.0.0",
            "query": "SELECT * FROM mytable"
    }

So come up with the json representation two datasets as illustrated above one for the output and one for the input.

Then create a Kubernetes Job which has annotations containing those datasets; e.g.

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
  annotations:
     metadata.kubeflow.org/input: "[{"apiversion": "v1", "category": "artifact", ...}]"
     metadata.kubeflow.org/output: "[{"apiversion": "v1", "category": "artifact", ...}]"

spec:
  ...

When the job runs we should end up creating 2 artifacts in metadata corresponding to the input and output, and 1 execution corresponding to the job.

If I go the metadata UI I'd like to see a lineage graph connecting the input to the output via the execution.

jlewi commented 3 years ago

@swikar I hacked together a rudimentary implementation in #246

That PR contains an example job: https://github.com/kubeflow/metadata/blob/96285b257d999715a88befc659d722c895d7bbc5/watcher/examples/simple_job.yaml

The watcher then produces a lineage graph like the following

lineage

ca-scribner commented 3 years ago

Loving this idea - I've wanted to help people use the lineage explorer/kf metadata but without them teaching their training/predicting/... processes how to do it. This'll be great.

DataSet, Model, and Metrics all have their own artifact in kf metadata and all would all be really helpful to log. For Models and Metrics, there is also a convention to put important values (hyperparameters for Models, accuracy/... for Metrics) in the metadata entry (in addition to a URI with the actual metric file). If we can automate that as well that would be great.

Swikar commented 3 years ago

This is awesome. Let me dig into it

Jeffwan commented 3 years ago

The way to use MLMD might be different for TFJob, PyTorchJab, KatibJob.

The batch resource (TFJob, TaskRun etc....) should be stored in MLMD as an execution

Do we want to use all the data type from MLMD or a subset?

We may use it in a different way, for example, create a context for each TFJob definition, create execution for TFjobRun. I think most jobs support backoff and each run can be different execution. We need extra work to create attributions to link artifacts back to context.

For jobs like Katib, it's more complicated, we probably need to come up a mapping from StudyJob, Trail, Experiments (more layers) to MLMD concepts. We can start from coarse-grained implementation.

Another thing I'd like to discuss is I know there's a discussion on Kubeflow native application. https://docs.google.com/document/d/1ss9pdUODTRq6SbkQz_ctCBVM9I2uMA7eVO7lv7YuKR4/edit I think this would be better way to integrate with input & outputs artifacts? I know this won't be there in short term, I think we can start from annotation.

ca-scribner commented 3 years ago

Providing a watcher that can spot annotations and create from them real MLMD objects, we give developers (or users) of the different products to make their own decisions. So someone making the launch TFJob code could automate the add metadata that makes relevant mlmd objects to their liking