kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.48k stars 439 forks source link

A metrics collector for Kubeflow Pipeline Metrics artifacts #2019

Open votti opened 1 year ago

votti commented 1 year ago

/kind feature

Describe the solution you'd like Currently a aim is to do parameter tuning over pipelines in katib (#1914, #1993).

Kubeflow pipelines allow for dedicated metrics artifacts: https://kubeflow-pipelines.readthedocs.io/en/master/source/dsl.html?h=metrics#kfp.dsl.Metrics https://www.kubeflow.org/docs/components/pipelines/v1/sdk/pipelines-metrics/

Having a dedicated Katib sidecar metrics collector that collects the metrics from this artifacts, would make pipelines and katib work together quite nicely.

The current workaround is to use the stdout collector, but this causes issues with the complex commands in pipeline components (#1914, will add dedicated issue soon).

Anything else you would like to add:


Love this feature? Give it a 👍 We prioritize the features with the most 👍

votti commented 1 year ago

I think I may give this a go - I would try to build this in Python analogous to the tfevent-metricscollector. Does this sound like a reasonable approach? I am also happy for any other suggestion.

votti commented 1 year ago

Small update: I have now a metrics collector for kubeflow v1 pipelines that I think should work and according to the logs already manages to caputre the pipeline metrics artifacts(modeled after tfevent-metricscollector).

What I am failing is to pass the current trial name to the custom connector in the metricsCollectorSpec Essentially I am using the a very similar configuration as in the custom connector example here: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/custom-metrics-collector.yaml#L13-L35

My cli metricscollector takes an argument "-t" or "--trial_name" with the trial name to use for reporting (exactly as the tfevent-metricscollector). Would maybe someone know a hint how to configure this such that the current trial-name would be passed as arg?

votti commented 1 year ago

I am now really a bit confused: Reading the source code of the metrics collector sidecar injection inject_webhook, it looks to me as if the trial name should be actually added to the args: https://github.com/kubeflow/katib/blob/22b740802a06d8926255b204076837d6e344ebb9/pkg/webhook/v1beta1/pod/inject_webhook.go#L302

Yet looking at the pods Katib creates, all these arguments seem to be missing. Is there anything I do not see?

My current section to specify the metrics collector:

  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: "/tmp/outputs/mlpipeline_metrics/data"
        kind: File
    collector:
      customCollector:
        image: votti/kfpv1-metricscollector:v0.0.7
        imagePullPolicy: Always
        name: custom-metrics-logger-and-collector
      kind: Custom

Which creates a specification as:

  - image: votti/kfpv1-metricscollector:v0.0.7
    imagePullPolicy: Always
    name: custom-metrics-logger-and-collector
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp/outputs/mlpipeline_metrics
      name: metrics-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-rnmkw
      readOnly: true
andreyvelich commented 1 year ago

Thank you for working on this @votti! Would it be easier to use push-based metrics collector for such use-cases (ref: https://github.com/kubeflow/katib/issues/577)? Then we don't even need a sidecar to collect metrics.

cc @johnugeorge @gaocegege @tenzen-y

votti commented 1 year ago

I now managed to implement a working metrics collector for Kubeflow Pipeline V1 Metrics artifacts: https://github.com/d-one/katib/tree/feature/kfpv1-metricscollector/cmd/metricscollector/v1beta1/kfpv1-metricscollector

For a full example how this is used see: https://github.com/votti/katib-exploration/blob/main/notebooks/mnist_pipeline_v1.ipynb

@Push: I think it is an interesting idea to build a dedicated KubeflowPipeline component that can push metrics to Katib. Challenges I see here is how to pass the current trial_name. Otherwise the component could be built quite similar to the kfpv1-metricscollector.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AlexandreBrown commented 1 year ago

Hello, any update for KFP v2?
Cheers!

andreyvelich commented 1 year ago

@AlexandreBrown We've worked on Katib + KFP example in this PR: https://github.com/kubeflow/katib/pull/2118 Any help and review for this PR are appreciated!

AlexandreBrown commented 1 year ago

@AlexandreBrown We've worked on Katib + KFP example in this PR: https://github.com/kubeflow/katib/pull/2118 Any help and review for this PR are appreciated!

Great to see progress, was this PR made for kfp v2 or only v1?

tenzen-y commented 1 year ago

Great to see progress, was this PR made for kfp v2 or only v1?

That PR is only for v1.

votti commented 1 year ago

@AlexandreBrown This is based on V1 as I only managed to compile the pipeline in KFP V1 as an Argo Workflow manifest. If there is a way to export KFP V2 as Argo workflow it should be straightforward to use V2 as well.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 9 months ago

/lifecycle frozen