kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.49k stars 440 forks source link

[GSoC] Add New Parameter in `tune` #2369

Closed Electronic-Waste closed 2 months ago

Electronic-Waste commented 3 months ago

What this PR does / why we need it:

I add a new parameter metrics_collector_config to tune function. Design details: https://github.com/kubeflow/katib/blob/cc95ef03cb25df3d86a1a1c20c9c69bad17fce92/docs/proposals/push-based-metrics-collection.md#add-new-parameter-in-tune

Which issue(s) this PR fixes _(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged)_: Fixes #

Checklist:

Electronic-Waste commented 3 months ago

ref issue: #2340

Electronic-Waste commented 3 months ago

PTAL👀 @andreyvelich @johnugeorge

Electronic-Waste commented 2 months ago

@andreyvelich I add my code here: https://github.com/kubeflow/katib/blob/db17214cf02c8d4eb9dafc21a15ffb7af4596015/pkg/webhook/v1beta1/pod/inject_webhook.go#L140-L142

// Pass env variable KATIB_TRIAL_NAME to training containers using fieldPath.
for idx := range mutatedPod.Spec.Containers {
    if mutatedPod.Spec.Containers[idx].Env == nil {
        mutatedPod.Spec.Containers[idx].Env = []v1.EnvVar{}
    }
    mutatedPod.Spec.Containers[idx].Env = append(
        mutatedPod.Spec.Containers[idx].Env, 
        v1.EnvVar{
            Name: consts.EnvTrialName,
            ValueFrom: &v1.EnvVarSource{
                FieldRef: &v1.ObjectFieldSelector{
                    FieldPath: fmt.Sprintf("metadata.labels['%s']", consts.LabelTrialName),
                },
            },
        },
    )
}

PTAL👀. Thank you for your review!

Electronic-Waste commented 2 months ago

/area gsoc

andreyvelich commented 2 months ago

I don't think, we should mutate KATIB_TRIAL_NAME env to all containers in the Pod. So maybe, similar as in wrapWorkerContainer function, we should only add this environment to the primary container.

andreyvelich commented 2 months ago

Please rebase your PR as well @Electronic-Waste

andreyvelich commented 2 months ago

@Electronic-Waste Additionally, I think we forgot to add this line to the post-gen script of Katib SDK:

# Import Katib report metrics functions
from kubeflow.katib.api.report_metrics import report_metrics

Similar to how we include other imports to the__init__.py. Since this file is auto-generated by code-gen, we need to update it via post-gen script.

@Electronic-Waste Please update this script, otherwise after we re-generate the SDK, your import will be deleted.

Electronic-Waste commented 2 months ago

@andreyvelich Done! I've rebased the branch, modified the code to pass env variable only to the primary container and added the importing code to the post-gen script of Katib SDK.

Electronic-Waste commented 2 months ago

@andreyvelich Thank you for your detailed review!

Electronic-Waste commented 2 months ago

@andreyvelich I've wrapped the code into mutatePodEnv and added some unit tests for it. PTAL👀

google-oss-prow[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/katib/blob/master/OWNERS)~~ [andreyvelich] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment