kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.48k stars 438 forks source link

Experiment stuck due to hyperparameter suggestion pod gettin OOM Killed #2260

Open UrkoAT opened 6 months ago

UrkoAT commented 6 months ago

/kind bug

What steps did you take and what happened:

Every time I try to run an experiment (in this case using Bayesian Optimization) after 18-25 trials, the pod that schedules the trials with suggested hyperparameters gets OOM Killed and the experiment does not continue.

If you manually increase the memory resources and limits of the deployment, it works like a charm.

What did you expect to happen:

It should either take less memory and work with this kind of numbers of trials without any problem, or more easily just increase the default limit of the suggestion pod memory to a more realistic number.

Anything else you would like to add: The experiment is the following:

metadata:
  name: [REDACTED]
  namespace:  [REDACTED]
  uid: ee155028-e668-4ce3-a737-87e376268b43
  resourceVersion: '68982612'
  generation: 2
  creationTimestamp: '2024-02-09T09:17:02Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: Go-http-client
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2024-02-09T09:17:02Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
         ...
    - manager: kubectl-edit
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2024-02-19T08:13:38Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:parallelTrialCount: {}
    - manager: Go-http-client
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2024-02-19T09:06:05Z'
      fieldsType: FieldsV1
      fieldsV1:
       ...
      subresource: status
spec:
  parameters:
    - name: batch_size
      parameterType: discrete
      feasibleSpace:
        list:
          - '32'
          - '64'
          - '128'
          - '256'
    - name: fnok_weight
      parameterType: discrete
      feasibleSpace:
        list:
          - '2.0'
          - '2.8'
          - '3.2'
    - name: nok_weight
      parameterType: discrete
      feasibleSpace:
        list:
          - '2.0'
          - '2.8'
          - '3.2'
    - name: epochs
      parameterType: discrete
      feasibleSpace:
        list:
          - '30'
          - '50'
          - '100'
          - '150'
          - '200'
          - '250'
    - name: model
      parameterType: categorical
      feasibleSpace:
        list:
          - [REDACTED]
          -  [REDACTED]
    - name: base_lr
      parameterType: discrete
      feasibleSpace:
        list:
          - '0.001'
          - '0.0001'
    - name: fraction
      parameterType: discrete
      feasibleSpace:
        list:
          - '0.05'
          - '0.1'
          - '0.25'
          - '0.5'
    - name: trainable_layers
      parameterType: categorical
      feasibleSpace:
        list:
          - dense_12
          - dense_12;dense_11
          - dense_12;dense_11;dense_10;dense_9
          - dense_12;dense_11;dense_10;dense_9;conv2d_12;conv2d_11
          - all
  objective:
    type: maximize
    objectiveMetricName: test_f2
    additionalMetricNames:
      - test_recall
      - test_specificity
      - test_accuracy
      - test_precision
      - model_runid
    metricStrategies:
      - name: test_f2
        value: max
      - name: test_recall
        value: max
      - name: test_specificity
        value: max
      - name: test_accuracy
        value: max
      - name: test_precision
        value: max
      - name: model_runid
        value: latest
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
      - name: random_state
        value: '12'
  trialTemplate:
    retain: true
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        backoffLimit: 1
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: 'false'
          spec:
            containers:
              - command:
                  - python3
                  - '-u'
                  - scriptKubetrain.py
                  - '--batch_size=${trialParameters.batch_size}'
                  - '--epochs=${trialParameters.epochs}'
                  - '--model=${trialParameters.model}'
                  - '--base_lr=${trialParameters.base_lr}'
                  - '--fnok_weight=${trialParameters.fnok_weight}'
                  - '--nok_weight=${trialParameters.nok_weight}'
                  - '--fraction=${trialParameters.fraction}'
                  - '--referencias=" [REDACTED]"'
                  - '--images_folder=/rx/L4'
                  - '--patches_folders=/data/patches/ALL/ [REDACTED]/'
                  - '--in_memory=False'
                  - '--save_folder=/data/models/ [REDACTED]/'
                  - '--model_name= [REDACTED]'
                  - '--trainable_layers="${trialParameters.trainable_layers}"'
                image: docker-registry:5000/ [REDACTED]
                imagePullPolicy: Always
                name: training-container
                resources:
                  limits:
                    nvidia.com/mig-2g.10gb: 1
                volumeMounts:
                  - mountPath: /data
                    name: mlops
                  - mountPath: /rx
                    name: rx-mount
            imagePullSecrets:
              - name: registry-access-secret
            restartPolicy: Never
            volumes:
              - name: mlops
                nfs:
                  path: /mlops
                  server: nfs-srv
              - flexVolume:
                  driver: fstab/cifs
                  fsType: cifs
                  options:
                    mountOptions: iocharset=utf8,file_mode=0777,dir_mode=0777,noperm
                    networkPath:  [REDACTED]
                  secretRef:
                    name: cifs-creds
                name: rx-mount
    trialParameters:
      - name: batch_size
        reference: batch_size
      - name: epochs
        reference: epochs
      - name: model
        reference: model
      - name: base_lr
        reference: base_lr
      - name: fraction
        reference: fraction
      - name: fnok_weight
        reference: fnok_weight
      - name: nok_weight
        reference: nok_weight
      - name: trainable_layers
        reference: trainable_layers
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 4
  maxTrialCount: 1000
  maxFailedTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  resumePolicy: LongRunning
status:
  startTime: '2024-02-09T09:16:41Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2024-02-09T09:16:41Z'
      lastTransitionTime: '2024-02-09T09:16:41Z'
    - type: Running
      status: 'True'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2024-02-09T09:17:24Z'
      lastTransitionTime: '2024-02-09T09:17:24Z'
  currentOptimalTrial:
    ...

Environment:


Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

andreyvelich commented 6 months ago

Thank you for creating this issue @UrkoAT. This is interesting, by default we use the following resources for Suggestion deployment: https://github.com/kubeflow/katib/blob/master/pkg/apis/config/v1beta1/defaults.go#L28-L34.

I was trying to run your example with those HPs and yes my Suggestion went OOM after 12 Trials: The latest log that I got:


INFO:pkg.suggestion.v1beta1.skopt.base_service:----------------------------------------------------------------------------------------------------

INFO:pkg.suggestion.v1beta1.skopt.base_service:New GetSuggestions call with current request number: 3

INFO:pkg.suggestion.v1beta1.skopt.base_service:Succeeded Trials changed: 12

INFO:pkg.suggestion.v1beta1.skopt.base_service:Running Optimizer tell to record observation
INFO:pkg.suggestion.v1beta1.skopt.base_service:Evaluated parameters: [['256', '2.8', '3.2', '50', 'test', '0.001', '0.1', 'dense_12;dense_11'], ['256', '2.8', '3.2', '30', 'test2', '0.0001', '0.01', 'dense_12']]
INFO:pkg.suggestion.v1beta1.skopt.base_service:Objective values: [-2.347, -3.06]

INFO:pkg.suggestion.v1beta1.skopt.base_service:Optimizer tell method takes 1 seconds
INFO:pkg.suggestion.v1beta1.skopt.base_service:List of recorded Trials names: ['test-memory-6kjct2bf', 'test-memory-nqvprjz6', 'test-memory-jr9cczjb', 'test-memory-cbpx68tw', 'test-memory-7hxnzbng', 'test-memory-fh2kzbf8', 'test-memory-t2mcsc7j', 'test-memory-lcs7phk4', 'test-memory-gc7mcfk7', 'test-memory-xw4klqgh', 'test-memory-697h6ccg', 'test-memory-dx9qt2p7']

We use Skopt for Bayesian Optimization algorithm: https://github.com/scikit-optimize/scikit-optimize. It seems that when number of HyperParameters grow Skopt requires more resources to store them in memory.

What do you think about it @johnugeorge @tenzen-y ? Maybe by default we should not specify resources for Suggestion and Metrics Collector deployment (e.g. instead of asking user to set -1 in Katib Config, we just remove resources assignment) ?

If users want to limit resources that can be consumed by Suggestion, they can always do it via Katib Config: https://www.kubeflow.org/docs/components/katib/katib-config/

UrkoAT commented 6 months ago

Thanks for answering this fast lol. Reading the documentation you just provided, I realize it isn't necessary to manually change the resources in each experiment. While you guys discuss what should be done, if any other person has this problem, for now, you can kubectl edit cm -n kubeflow katib-config and in suggestions: you can specify the resources for any algorithm. For example, in my case:

- algorithmName: bayesianoptimization
  image: docker.io/kubeflowkatib/suggestion-skopt:v0.16.0-rc.1
  resources:
    requests:
      memory: 100Mi
    limits:
      memory: 1Gi

I hope it helps.

tenzen-y commented 6 months ago

What do you think about it @johnugeorge @tenzen-y ? Maybe by default we should not specify resources for Suggestion and Metrics Collector deployment (e.g. instead of asking user to set -1 in Katib Config, we just remove resources assignment) ?

@andreyvelich, I think that we should keep setting default resources to the suggestion services regardless of whether we increase default resources. Because the Pods without resource requests/limits mean that the QoS class is BestEffort, and as a result, we allow the k8s cluster to evict the suggestion pods despite being mandatory for the Experiments.

The eviction of the suggestion Pods often occurs, and it will bring users confusion since, in general, the ML cluster is busier than other types of clusters.

andreyvelich commented 6 months ago

When you say Kubernetes can evict Suggestion pod can you explain it please @tenzen-y ? E.g. we don't set Pod Priority for our Suggestion pods or other parameters that identify that this pod can evicted: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#pod-selection-for-kubelet-eviction

Also, if you take a look on our Katib Controller Deployment, we also don't specify resources: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/controller.yaml

Should it be users or admins responsibility to specify appropriate Pod resources for Katib components ?

andreyvelich commented 6 months ago

I guess, you mention about this BestEffort QoS class: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#besteffort. The question is for us what is better ?

  1. Users to see OOM on their Suggestion pods
  2. Users to see pod being evicted by Kubernetes if Node doesn't have enough resources to run Suggestion pod.

What are your thoughts @johnugeorge ?

tenzen-y commented 6 months ago

I guess, you mention about this BestEffort QoS class: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#besteffort. The question is for us what is better ?

  1. Users to see OOM on their Suggestion pods
  2. Users to see pod being evicted by Kubernetes if Node doesn't have enough resources to run Suggestion pod.

What are your thoughts @johnugeorge ?

Yes, I meant what you said.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 3 months ago

/good-first-issue /help

google-oss-prow[bot] commented 3 months ago

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/katib/issues/2260): >/good-first-issue >/help > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
Souradip121 commented 1 week ago

/assign