kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.5k stars 441 forks source link

Katib experiment not running, stuck with message " Couldn't find any successful Trial." #2163

Closed yadavvij closed 10 months ago

yadavvij commented 1 year ago

/kind bug

What steps did you take and what happened:

  1. Run an experiment with edit Yaml option.
  2. edit yaml with the below yaml file and clicked on create
  3. we get this output
image image

The following logs are seen from the Katib controller: {"level":"info","ts":1686824428.8087966,"logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on suggestions.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.3886456,"logger":"suggestion-client","msg":"Algorithm settings are validated","Suggestion":"user01/tfjob-mnist-example"} {"level":"info","ts":1686824442.3887098,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"user01/tfjob-mnist-example","Suggestion Requests":3,"Suggestion Count":0} {"level":"info","ts":1686824442.3950624,"logger":"suggestion-client","msg":"Getting suggestions","Suggestion":"user01/tfjob-mnist-example","endpoint":"tfjob-mnist-example-random.user01:6789","Number of current request parameters":3,"Number of response parameters":3} {"level":"info","ts":1686824442.4057593,"logger":"experiment-controller","msg":"Statistics","Experiment":"user01/tfjob-mnist-example","requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0} {"level":"info","ts":1686824442.4057767,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"user01/tfjob-mnist-example","addCount":3} {"level":"info","ts":1686824442.405782,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"user01/tfjob-mnist-example","name":"tfjob-mnist-example","Suggestion Requests":3} {"level":"info","ts":1686824442.4058278,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"user01/tfjob-mnist-example","Suggestion Requests":3,"Suggestion Count":3} {"level":"info","ts":1686824442.427896,"logger":"experiment-controller","msg":"Created Trials","Experiment":"user01/tfjob-mnist-example","trialNames":["tfjob-mnist-example-2ff4vxph","tfjob-mnist-example-dcdlnmwl","tfjob-mnist-example-6rfqc8lm"]} {"level":"info","ts":1686824442.4427319,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.4563751,"logger":"trial-controller","msg":"Creating Job","Trial":"user01/tfjob-mnist-example-2ff4vxph","kind":"TFJob","name":"tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824442.462671,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"user01/tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824442.4865818,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.4867158,"logger":"trial-controller","msg":"Creating Job","Trial":"user01/tfjob-mnist-example-dcdlnmwl","kind":"TFJob","name":"tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824442.4927397,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"user01/tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824442.5024211,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.51115,"logger":"trial-controller","msg":"Creating Job","Trial":"user01/tfjob-mnist-example-6rfqc8lm","kind":"TFJob","name":"tfjob-mnist-example-6rfqc8lm"} {"level":"info","ts":1686824442.5161304,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.5200617,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"user01/tfjob-mnist-example-6rfqc8lm"} {"level":"info","ts":1686824442.5349317,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824444.4125206,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-2ff4vxph-worker-0","Trial":"tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824446.3909032,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-2ff4vxph-worker-1","Trial":"tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824448.3626342,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-dcdlnmwl-worker-0","Trial":"tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824450.3192122,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-dcdlnmwl-worker-1","Trial":"tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824452.2932706,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-6rfqc8lm-worker-0","Trial":"tfjob-mnist-example-6rfqc8lm"} {"level":"info","ts":1686824454.251165,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-6rfqc8lm-worker-1","Trial":"tfjob-mnist-example-6rfqc8lm"}

Yaml file used to create this experiment

apiVersion: kubeflow.org/v1 kind: Experiment metadata: namespace: user01 name: tfjob-mnist-example spec: parallelTrialCount: 3 maxTrialCount: 8 maxFailedTrialCount: 3 objective: type: maximize goal: 0.7 objectiveMetricName: accuracy algorithm: algorithmName: random parameters:

Environment:


Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

andreyvelich commented 1 year ago

@yadavvij Please can you check the logs from Trial TFJob pods ?

kubectl logs tfjob-mnist-example-2ff4vxph-worker-0 -n user01

Also try to describe one of the Trials:

kubectl describe trial tfjob-mnist-example-2ff4vxph -n user01
yadavvij commented 1 year ago

@andreyvelich ,I have tried checking these logs earlier as well, it didn't give any output as pods are in not ready state, i have attached the screenshot for both commands ,please let me know if you need anything else.

tfjob Screenshot
andreyvelich commented 1 year ago

@yadavvij I think, the problem is that you didn't properly disable Istio Sidecar for your Training TFJob Pods. Please add this annotation sidecar.istio.io/inject: 'false' under trialSpec.spec.template.tfReplicaSpecs.Worker.template.metadata.annotations. Similar as in this example: https://www.kubeflow.org/docs/components/training/tftraining/#what-is-tfjob

yadavvij commented 1 year ago

Thank you @andreyvelich ,the above solution worked for TFjob. Request you to help me with xgboostjob and pytorchjob as i am facing same issue with these as well,attaching the YAML for both ,please guide where to change. xgboost.yaml

apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: xgboost-job-lightgbm spec: objective: type: maximize goal: 0.7 objectiveMetricName: valid_1 auc additionalMetricNames:

pytorchjob

apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: pytorchjob-mnist spec: parallelTrialCount: 3 maxTrialCount: 5 maxFailedTrialCount: 3 objective: type: minimize goal: 0.1 objectiveMetricName: loss algorithm: algorithmName: random parameters:

andreyvelich commented 1 year ago

@yadavvij Similar to TFJob, you should disable istio annotation. For PyTorchJob:

trialSpec.spec.template.pytorchReplicaSpecs.Master.template.metadata.annotations

For XGBoost, you use just a Kubernetes Job and you set the annotation correct. Did you see any errors ?

yadavvij commented 1 year ago

@andreyvelich ,i tried creating pytorchjob with below yaml after disabling istio,still facing same issue, please let me know if i am doing something wrong in YAML. Pytorch yaml

apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: pytorchjob-mnist spec: parallelTrialCount: 3 maxTrialCount: 5 maxFailedTrialCount: 3 objective: type: minimize goal: 0.1 objectiveMetricName: loss algorithm: algorithmName: random parameters:

yadavvij commented 1 year ago

@andreyvelich got this error while creating experiment with xgboost as normal k8 job. image

yadavvij commented 1 year ago

@andreyvelich i closed it by mistake ,its still not completed

andreyvelich commented 1 year ago

@yadavvij Please can you show the Trial Template that you are trying to use in the UI ?

yadavvij commented 1 year ago

@andreyvelich please let me know what do you mean by trial template, i am attaching some screenshot and xgboost job yaml image xgboostjob apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: xgboost-job-lightgbm spec: objective: type: maximize goal: 0.7 objectiveMetricName: valid_1 auc additionalMetricNames:

andreyvelich commented 1 year ago

@yadavvij How did you get this error message: https://github.com/kubeflow/katib/issues/2163#issuecomment-1622279900 ? Did you submit the Experiment YAML that you provided in the Katib UI by clicking edit and submit YAML ?

yadavvij commented 1 year ago

@andreyvelich yes, as mentioned in my issue, i am creating experiment by clicking edit and submit Yaml .

andreyvelich commented 1 year ago

@yadavvij Can you show me formatted YAML that you are trying to submit ? (You can paste the formatter yaml with ```yaml)

E.g.

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: kubeflow
  name: random
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - sgd
          - adam
          - ftrl
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: numberLayers
        description: Number of training model layers
        reference: num-layers
      - name: optimizer
        description: Training model optimizer (sdg, adam or ftrl)
        reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: 'false'
          spec:
            containers:
              - name: training-container
                image: docker.io/kubeflowkatib/mxnet-mnist:latest
                command:
                  - "python3"
                  - "/opt/mxnet-mnist/mnist.py"
                  - "--batch-size=64"
                  - "--lr=${trialParameters.learningRate}"
                  - "--num-layers=${trialParameters.numberLayers}"
                  - "--optimizer=${trialParameters.optimizer}"
                resources:
                  limits:
                    memory: "1Gi"
                    cpu: "0.5"
            restartPolicy: Never
yadavvij commented 1 year ago

@sure ,i will paste it below. XGBoost job


apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: user01
  name: xgboost-job-lightgbm
spec:
  objective:
    type: maximize
    goal: 0.7
    objectiveMetricName: valid_1 auc
    additionalMetricNames:
      - valid_1 binary_logloss
      - training auc
      - training binary_logloss
  metricsCollectorSpec:
    source:
      filter:
        metricsFormat:
          - "(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"
  algorithm:
    algorithmName: random
  parallelTrialCount: 2
  maxTrialCount: 6
  maxFailedTrialCount: 3
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.1"
    - name: num-leaves
      parameterType: int
      feasibleSpace:
        min: "50"
        max: "60"
        step: "1"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: numberLeaves
        description: Number of leaves for one tree
        reference: num-leaves
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: 'false'
          spec:
          containers:
             - name: training-container
               image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
               ports:
                 - containerPort: 9991
                   name: xgboostjob-port
                   imagePullPolicy: Always
                   args:
                     - --job_type=Train
                     - --metric=binary_logloss,auc
                     - --learning_rate=${trialParameters.learningRate}
                     - --num_leaves=${trialParameters.numberLeaves}
                     - --num_trees=100
                     - --boosting_type=gbdt
                     - --objective=binary
                     - --metric_freq=1
                     - --is_training_metric=true
                     - --max_bin=255
                     - --data=data/binary.train
                     - --valid_data=data/binary.test
                     - --tree_learner=feature
                     - --feature_fraction=0.8
                     - --bagging_freq=5
                     - --bagging_fraction=0.8
                     - --min_data_in_leaf=50
                     - --min_sum_hessian_in_leaf=50
                     - --is_enable_sparse=true
                     - --use_two_round_loading=false
                     - --is_save_binary_file=false
yadavvij commented 1 year ago

pytorch job


apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: user01
  name: pytorchjob-mnist
spec:
  parallelTrialCount: 3
  maxTrialCount: 5
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.1
    objectiveMetricName: loss
  algorithm:
    algorithmName: random
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: momentum
      parameterType: double
      feasibleSpace:
        min: "0.5"
        max: "0.9"
  trialTemplate:
    primaryContainerName: pytorch
    primaryPodLabels:
      training.kubeflow.org/replica-type: worker
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: momentum
        description: Momentum for the training model
        reference: momentum
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: 'false'
        pytorchReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: 'false'
              spec:
                containers:
                  - name: pytorch
                    image: docker.io/kubeflowkatib/pytorch-mnist-v0.14.0
                    command:
                      - "python3"
                      - "/opt/pytorch-mnist/mnist.py"
                      - "--epochs=1"
                      - "--batch-size=16"
                      - "--lr=${trialParameters.learningRate}"
                      - "--momentum=${trialParameters.momentum}"
andreyvelich commented 1 year ago

@yadavvij In XGBoost you are missing indentation in trialSpec.spec.template.spec.containers

yadavvij commented 1 year ago

@andreyvelich corrected the indentation as suggested above,still getting the below error image


apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: user01
  name: xgboost-job-lightgbm
spec:
  objective:
    type: maximize
    goal: 0.7
    objectiveMetricName: valid_1 auc
    additionalMetricNames:
      - valid_1 binary_logloss
      - training auc
      - training binary_logloss
  metricsCollectorSpec:
    source:
      filter:
        metricsFormat:
          - "(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"
  algorithm:
    algorithmName: random
  parallelTrialCount: 2
  maxTrialCount: 6
  maxFailedTrialCount: 3
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.1"
    - name: num-leaves
      parameterType: int
      feasibleSpace:
        min: "50"
        max: "60"
        step: "1"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: numberLeaves
        description: Number of leaves for one tree
        reference: num-leaves
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: 'false'
          spec:
            containers:
              - name: training-container
                image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
                ports:
                  - containerPort: 9991
                    name: xgboostjob-port
                    imagePullPolicy: Always
                    args:
                      - --job_type=Train
                      - --metric=binary_logloss,auc
                      - --learning_rate=${trialParameters.learningRate}
                      - --num_leaves=${trialParameters.numberLeaves}
                      - --num_trees=100
                      - --boosting_type=gbdt
                      - --objective=binary
                      - --metric_freq=1
                      - --is_training_metric=true
                      - --max_bin=255
                      - --data=data/binary.train
                      - --valid_data=data/binary.test
                      - --tree_learner=feature
                      - --feature_fraction=0.8
                      - --bagging_freq=5
                      - --bagging_fraction=0.8
                      - --min_data_in_leaf=50
                      - --min_sum_hessian_in_leaf=50
                      - --is_enable_sparse=true
                      - --use_two_round_loading=false
                      - --is_save_binary_file=false
andreyvelich commented 1 year ago

@yadavvij I think, imagePullPolicy and args still have incorrect indentation.

yadavvij commented 1 year ago

@andreyvelich i corrected indentation and experiment got created ,but still i am not able to create a successfull experiment "Couldn't find any successful Trial."

image

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: user01
  name: xgboost-job-lightgbm
spec:
  objective:
    type: maximize
    goal: 0.7
    objectiveMetricName: valid_1 auc
    additionalMetricNames:
      - valid_1 binary_logloss
      - training auc
      - training binary_logloss
  metricsCollectorSpec:
    source:
      filter:
        metricsFormat:
          - "(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.1"
    - name: num-leaves
      parameterType: int
      feasibleSpace:
        min: "50"
        max: "60"
        step: "1"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: numberLeaves
        description: Number of leaves for one tree
        reference: num-leaves
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: 'false'
          spec:
            containers:
              - args:
                  - --job_type=Train
                  - --metric=binary_logloss,auc
                  - --learning_rate=${trialParameters.learningRate}
                  - --num_leaves=${trialParameters.numberLeaves}
                  - --num_trees=100
                  - --boosting_type=gbdt
                  - --objective=binary
                  - --metric_freq=1
                  - --is_training_metric=true
                  - --max_bin=255
                  - --data=data/binary.train
                  - --valid_data=data/binary.test
                  - --tree_learner=feature
                  - --feature_fraction=0.8
                  - --bagging_freq=5
                  - --bagging_fraction=0.8
                  - --min_data_in_leaf=50
                  - --min_sum_hessian_in_leaf=50
                  - --is_enable_sparse=true
                  - --use_two_round_loading=false
                  - --is_save_binary_file=false
                image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
                imagePullPolicy: Always
                name: xgboost
                ports:
                  - containerPort: 9991
                    name: xgboostjob-port 
                    protocol: TCP
andreyvelich commented 1 year ago

"Couldn't find any successful Trial."

I think, you also miss restartPolicy for your Trial Job. For Kubernetes BatchJob it is necessary to set this value. Please set the restart policy, similar to this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/hp-tuning/hyperband.yaml#L81C13-L81C33

yadavvij commented 1 year ago

@andreyvelich the above yaml for xgboost runs fine but when i try to run it with kind: XGBoostJob instead of job, it gives same error " Couldn't find any successful Trial". Same issue with pytorchjob, why it doesn't run with kind as xgboostjob or pytorchjob.

yaml for Pytorchjob


apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  namespace: user01
  name: pytorchjob-mnist
spec:
  parallelTrialCount: 3
  maxTrialCount: 5
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.1
    objectiveMetricName: loss
  algorithm:
    algorithmName: random
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: momentum
      parameterType: double
      feasibleSpace:
        min: "0.5"
        max: "0.9"
  trialTemplate:
    primaryContainerName: pytorch
    primaryPodLabels:
      training.kubeflow.org/replica-type: worker
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: momentum
        description: Momentum for the training model
        reference: momentum
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        template:
          metadata:
            annotations:
              sidecar.istio.io/inject: 'false'
        pytorchReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: 'false'
              spec:
                containers:
                  - name: pytorch
                    image: docker.io/kubeflowkatib/pytorch-mnist-v0.14.0
                    command:
                      - "python3"
                      - "/opt/pytorch-mnist/mnist.py"
                      - "--epochs=1"
                      - "--batch-size=16"
                      - "--lr=${trialParameters.learningRate}"
                      - "--momentum=${trialParameters.momentum}"
andreyvelich commented 1 year ago

@yadavvij I think, you also set incorrect YAML for PyTorchJob, here: trialSpec.spec.template.. The PyTorchJob doesn't have such APIs. Istio Annotations should be only in one place: trialSpec.spec.pytorchReplicaSpecs.Worker.template.metadata.annotations

Please refer to this example for how to set PyTorchJob correct: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-training-operator/pytorchjob-mnist.yaml#L38-L71

andreyvelich commented 1 year ago

Hi @yadavvij, any success with modifying the annotation and API spec ?

rogeryuchao commented 1 year ago

Hello @andreyvelich, I solved the NotReady issue by force disable the istio-proxy sidecar. While I think it is not the best practice because from the experiment job pod I can hacking to connect other namespaces service, which makes the multi-talents not meaningful. Do you think it is possible to fix the traffic forwarding issue in near future?

Also, the experiment deployment(suggestion/early-stopping), katib DB manager, and katib mysql are also without the protection of istio sidecar mTLS communication, do you think it will become a severity security issue? Thanks a lot in advance

andreyvelich commented 1 year ago

Do you think it is possible to fix the traffic forwarding issue in near future?

Katib doesn't block traffic for your Trials. If you are going to create just PyTorchJob with some test HyperParameters it still fails because docker.io/kubeflowkatib/pytorch-mnist-v0.14.0 image downloads MNIST dataset from the internet. If you could setup Istio proxy to allow external access or build Docker image with pre-uploaded dataset, you can make it work with Istio Sidecar.

Also, the experiment deployment(suggestion/early-stopping), katib DB manager, and katib mysql are also without the protection of istio sidecar mTLS communication, do you think it will become a severity security issue?

What kind of security issues you can see here ? We discussed previously, that currently Katib DB Manager exposes gRPC API to report/get metrics for Trials: https://github.com/kubeflow/katib/issues/2022#issuecomment-1320200136. That API can be used if you have access to your Kubernetes cluster. Similar to that, Suggestion Deployment exposes API to get HyperParameters from Algorithm Service. Is there is something that you have concerns ?

yadavvij commented 1 year ago

Hi @yadavvij, any success with modifying the annotation and API spec ?

yes, i was able to successfully create the experiments with the above suggestions. Thank you for all the help @andreyvelich :) .

yadavvij commented 1 year ago

Hi @andreyvelich ,Can you please suggest any example repo for deploying an application in Kubernetes cluster through github action runners . Please share if any link available regarding the same. It would be a great help if i can get yaml configuration for the same.

andreyvelich commented 1 year ago

@yadavvij You can take a look at self-hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners. So you can configure your GitHub actions to deploy control plane on existing Kubernetes cluster that your runner is connected.

For our Katib E2Es, we use minikube to deploy Katib Control Plane. Then, we run Katib Experiment on that cluster.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 10 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.