Closed yadavvij closed 10 months ago
@yadavvij Please can you check the logs from Trial TFJob pods ?
kubectl logs tfjob-mnist-example-2ff4vxph-worker-0 -n user01
Also try to describe one of the Trials:
kubectl describe trial tfjob-mnist-example-2ff4vxph -n user01
@andreyvelich ,I have tried checking these logs earlier as well, it didn't give any output as pods are in not ready state, i have attached the screenshot for both commands ,please let me know if you need anything else.
@yadavvij I think, the problem is that you didn't properly disable Istio Sidecar for your Training TFJob Pods.
Please add this annotation sidecar.istio.io/inject: 'false'
under trialSpec.spec.template.tfReplicaSpecs.Worker.template.metadata.annotations
.
Similar as in this example: https://www.kubeflow.org/docs/components/training/tftraining/#what-is-tfjob
Thank you @andreyvelich ,the above solution worked for TFjob. Request you to help me with xgboostjob and pytorchjob as i am facing same issue with these as well,attaching the YAML for both ,please guide where to change. xgboost.yaml
apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: xgboost-job-lightgbm spec: objective: type: maximize goal: 0.7 objectiveMetricName: valid_1 auc additionalMetricNames:
name: numberLeaves description: Number of leaves for one tree reference: num-leaves trialSpec: apiVersion: batch/v1 kind: Job spec: template: metadata: annotations: sidecar.istio.io/inject: 'false'
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
ports:
- containerPort: 9991
name: xgboostjob-port
imagePullPolicy: Always
args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
pytorchjob
apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: pytorchjob-mnist spec: parallelTrialCount: 3 maxTrialCount: 5 maxFailedTrialCount: 3 objective: type: minimize goal: 0.1 objectiveMetricName: loss algorithm: algorithmName: random parameters:
name: momentum parameterType: double feasibleSpace: min: "0.5" max: "0.9" trialTemplate: primaryContainerName: pytorch trialParameters:
name: momentum description: Momentum for the training model reference: momentum trialSpec: apiVersion: kubeflow.org/v1 kind: PyTorchJob spec: template: metadata: annotations: sidecar.istio.io/inject: 'false' pytorchReplicaSpecs:
spec: containers:
@yadavvij Similar to TFJob, you should disable istio annotation. For PyTorchJob:
trialSpec.spec.template.pytorchReplicaSpecs.Master.template.metadata.annotations
For XGBoost, you use just a Kubernetes Job and you set the annotation correct. Did you see any errors ?
@andreyvelich ,i tried creating pytorchjob with below yaml after disabling istio,still facing same issue, please let me know if i am doing something wrong in YAML. Pytorch yaml
apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: pytorchjob-mnist spec: parallelTrialCount: 3 maxTrialCount: 5 maxFailedTrialCount: 3 objective: type: minimize goal: 0.1 objectiveMetricName: loss algorithm: algorithmName: random parameters:
@andreyvelich got this error while creating experiment with xgboost as normal k8 job.
@andreyvelich i closed it by mistake ,its still not completed
@yadavvij Please can you show the Trial Template that you are trying to use in the UI ?
@andreyvelich please let me know what do you mean by trial template, i am attaching some screenshot and xgboost job yaml xgboostjob apiVersion: kubeflow.org/v1beta1 kind: Experiment metadata: namespace: user01 name: xgboost-job-lightgbm spec: objective: type: maximize goal: 0.7 objectiveMetricName: valid_1 auc additionalMetricNames:
@yadavvij How did you get this error message: https://github.com/kubeflow/katib/issues/2163#issuecomment-1622279900 ? Did you submit the Experiment YAML that you provided in the Katib UI by clicking edit and submit YAML ?
@andreyvelich yes, as mentioned in my issue, i am creating experiment by clicking edit and submit Yaml .
@yadavvij Can you show me formatted YAML that you are trying to submit ? (You can paste the formatter yaml with ```yaml)
E.g.
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: kubeflow
name: random
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- Train-accuracy
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
- name: num-layers
parameterType: int
feasibleSpace:
min: "2"
max: "5"
- name: optimizer
parameterType: categorical
feasibleSpace:
list:
- sgd
- adam
- ftrl
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLayers
description: Number of training model layers
reference: num-layers
- name: optimizer
description: Training model optimizer (sdg, adam or ftrl)
reference: optimizer
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:latest
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=${trialParameters.learningRate}"
- "--num-layers=${trialParameters.numberLayers}"
- "--optimizer=${trialParameters.optimizer}"
resources:
limits:
memory: "1Gi"
cpu: "0.5"
restartPolicy: Never
@sure ,i will paste it below. XGBoost job
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: user01
name: xgboost-job-lightgbm
spec:
objective:
type: maximize
goal: 0.7
objectiveMetricName: valid_1 auc
additionalMetricNames:
- valid_1 binary_logloss
- training auc
- training binary_logloss
metricsCollectorSpec:
source:
filter:
metricsFormat:
- "(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"
algorithm:
algorithmName: random
parallelTrialCount: 2
maxTrialCount: 6
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.1"
- name: num-leaves
parameterType: int
feasibleSpace:
min: "50"
max: "60"
step: "1"
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLeaves
description: Number of leaves for one tree
reference: num-leaves
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
ports:
- containerPort: 9991
name: xgboostjob-port
imagePullPolicy: Always
args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
pytorch job
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: user01
name: pytorchjob-mnist
spec:
parallelTrialCount: 3
maxTrialCount: 5
maxFailedTrialCount: 3
objective:
type: minimize
goal: 0.1
objectiveMetricName: loss
algorithm:
algorithmName: random
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: momentum
parameterType: double
feasibleSpace:
min: "0.5"
max: "0.9"
trialTemplate:
primaryContainerName: pytorch
primaryPodLabels:
training.kubeflow.org/replica-type: worker
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: momentum
description: Momentum for the training model
reference: momentum
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-v0.14.0
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
- "--batch-size=16"
- "--lr=${trialParameters.learningRate}"
- "--momentum=${trialParameters.momentum}"
@yadavvij In XGBoost you are missing indentation in trialSpec.spec.template.spec.containers
@andreyvelich corrected the indentation as suggested above,still getting the below error
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: user01
name: xgboost-job-lightgbm
spec:
objective:
type: maximize
goal: 0.7
objectiveMetricName: valid_1 auc
additionalMetricNames:
- valid_1 binary_logloss
- training auc
- training binary_logloss
metricsCollectorSpec:
source:
filter:
metricsFormat:
- "(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"
algorithm:
algorithmName: random
parallelTrialCount: 2
maxTrialCount: 6
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.1"
- name: num-leaves
parameterType: int
feasibleSpace:
min: "50"
max: "60"
step: "1"
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLeaves
description: Number of leaves for one tree
reference: num-leaves
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
ports:
- containerPort: 9991
name: xgboostjob-port
imagePullPolicy: Always
args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
@yadavvij I think, imagePullPolicy and args still have incorrect indentation.
@andreyvelich i corrected indentation and experiment got created ,but still i am not able to create a successfull experiment "Couldn't find any successful Trial."
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: user01
name: xgboost-job-lightgbm
spec:
objective:
type: maximize
goal: 0.7
objectiveMetricName: valid_1 auc
additionalMetricNames:
- valid_1 binary_logloss
- training auc
- training binary_logloss
metricsCollectorSpec:
source:
filter:
metricsFormat:
- "(\\w+\\s\\w+)\\s:\\s((-?\\d+)(\\.\\d+)?)"
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.1"
- name: num-leaves
parameterType: int
feasibleSpace:
min: "50"
max: "60"
step: "1"
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: numberLeaves
description: Number of leaves for one tree
reference: num-leaves
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- args:
- --job_type=Train
- --metric=binary_logloss,auc
- --learning_rate=${trialParameters.learningRate}
- --num_leaves=${trialParameters.numberLeaves}
- --num_trees=100
- --boosting_type=gbdt
- --objective=binary
- --metric_freq=1
- --is_training_metric=true
- --max_bin=255
- --data=data/binary.train
- --valid_data=data/binary.test
- --tree_learner=feature
- --feature_fraction=0.8
- --bagging_freq=5
- --bagging_fraction=0.8
- --min_data_in_leaf=50
- --min_sum_hessian_in_leaf=50
- --is_enable_sparse=true
- --use_two_round_loading=false
- --is_save_binary_file=false
image: docker.io/kubeflowkatib/xgboost-lightgbm:1.0
imagePullPolicy: Always
name: xgboost
ports:
- containerPort: 9991
name: xgboostjob-port
protocol: TCP
"Couldn't find any successful Trial."
I think, you also miss restartPolicy
for your Trial Job. For Kubernetes BatchJob it is necessary to set this value.
Please set the restart policy, similar to this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/hp-tuning/hyperband.yaml#L81C13-L81C33
@andreyvelich the above yaml for xgboost runs fine but when i try to run it with kind: XGBoostJob instead of job, it gives same error " Couldn't find any successful Trial". Same issue with pytorchjob, why it doesn't run with kind as xgboostjob or pytorchjob.
yaml for Pytorchjob
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: user01
name: pytorchjob-mnist
spec:
parallelTrialCount: 3
maxTrialCount: 5
maxFailedTrialCount: 3
objective:
type: minimize
goal: 0.1
objectiveMetricName: loss
algorithm:
algorithmName: random
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: momentum
parameterType: double
feasibleSpace:
min: "0.5"
max: "0.9"
trialTemplate:
primaryContainerName: pytorch
primaryPodLabels:
training.kubeflow.org/replica-type: worker
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
- name: momentum
description: Momentum for the training model
reference: momentum
trialSpec:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist-v0.14.0
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
- "--batch-size=16"
- "--lr=${trialParameters.learningRate}"
- "--momentum=${trialParameters.momentum}"
@yadavvij I think, you also set incorrect YAML for PyTorchJob, here: trialSpec.spec.template.
. The PyTorchJob doesn't have such APIs.
Istio Annotations should be only in one place: trialSpec.spec.pytorchReplicaSpecs.Worker.template.metadata.annotations
Please refer to this example for how to set PyTorchJob correct: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-training-operator/pytorchjob-mnist.yaml#L38-L71
Hi @yadavvij, any success with modifying the annotation and API spec ?
Hello @andreyvelich, I solved the NotReady issue by force disable the istio-proxy sidecar. While I think it is not the best practice because from the experiment job pod I can hacking to connect other namespaces service, which makes the multi-talents not meaningful. Do you think it is possible to fix the traffic forwarding issue in near future?
Also, the experiment deployment(suggestion/early-stopping), katib DB manager, and katib mysql are also without the protection of istio sidecar mTLS communication, do you think it will become a severity security issue? Thanks a lot in advance
Do you think it is possible to fix the traffic forwarding issue in near future?
Katib doesn't block traffic for your Trials. If you are going to create just PyTorchJob with some test HyperParameters it still fails because docker.io/kubeflowkatib/pytorch-mnist-v0.14.0
image downloads MNIST dataset from the internet.
If you could setup Istio proxy to allow external access or build Docker image with pre-uploaded dataset, you can make it work with Istio Sidecar.
Also, the experiment deployment(suggestion/early-stopping), katib DB manager, and katib mysql are also without the protection of istio sidecar mTLS communication, do you think it will become a severity security issue?
What kind of security issues you can see here ? We discussed previously, that currently Katib DB Manager exposes gRPC API to report/get metrics for Trials: https://github.com/kubeflow/katib/issues/2022#issuecomment-1320200136. That API can be used if you have access to your Kubernetes cluster. Similar to that, Suggestion Deployment exposes API to get HyperParameters from Algorithm Service. Is there is something that you have concerns ?
Hi @yadavvij, any success with modifying the annotation and API spec ?
yes, i was able to successfully create the experiments with the above suggestions. Thank you for all the help @andreyvelich :) .
Hi @andreyvelich ,Can you please suggest any example repo for deploying an application in Kubernetes cluster through github action runners . Please share if any link available regarding the same. It would be a great help if i can get yaml configuration for the same.
@yadavvij You can take a look at self-hosted runners: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners. So you can configure your GitHub actions to deploy control plane on existing Kubernetes cluster that your runner is connected.
For our Katib E2Es, we use minikube
to deploy Katib Control Plane.
Then, we run Katib Experiment on that cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/kind bug
What steps did you take and what happened:
The following logs are seen from the Katib controller: {"level":"info","ts":1686824428.8087966,"logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on suggestions.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.3886456,"logger":"suggestion-client","msg":"Algorithm settings are validated","Suggestion":"user01/tfjob-mnist-example"} {"level":"info","ts":1686824442.3887098,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"user01/tfjob-mnist-example","Suggestion Requests":3,"Suggestion Count":0} {"level":"info","ts":1686824442.3950624,"logger":"suggestion-client","msg":"Getting suggestions","Suggestion":"user01/tfjob-mnist-example","endpoint":"tfjob-mnist-example-random.user01:6789","Number of current request parameters":3,"Number of response parameters":3} {"level":"info","ts":1686824442.4057593,"logger":"experiment-controller","msg":"Statistics","Experiment":"user01/tfjob-mnist-example","requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0} {"level":"info","ts":1686824442.4057767,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"user01/tfjob-mnist-example","addCount":3} {"level":"info","ts":1686824442.405782,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"user01/tfjob-mnist-example","name":"tfjob-mnist-example","Suggestion Requests":3} {"level":"info","ts":1686824442.4058278,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"user01/tfjob-mnist-example","Suggestion Requests":3,"Suggestion Count":3} {"level":"info","ts":1686824442.427896,"logger":"experiment-controller","msg":"Created Trials","Experiment":"user01/tfjob-mnist-example","trialNames":["tfjob-mnist-example-2ff4vxph","tfjob-mnist-example-dcdlnmwl","tfjob-mnist-example-6rfqc8lm"]} {"level":"info","ts":1686824442.4427319,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.4563751,"logger":"trial-controller","msg":"Creating Job","Trial":"user01/tfjob-mnist-example-2ff4vxph","kind":"TFJob","name":"tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824442.462671,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"user01/tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824442.4865818,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.4867158,"logger":"trial-controller","msg":"Creating Job","Trial":"user01/tfjob-mnist-example-dcdlnmwl","kind":"TFJob","name":"tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824442.4927397,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"user01/tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824442.5024211,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.51115,"logger":"trial-controller","msg":"Creating Job","Trial":"user01/tfjob-mnist-example-6rfqc8lm","kind":"TFJob","name":"tfjob-mnist-example-6rfqc8lm"} {"level":"info","ts":1686824442.5161304,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824442.5200617,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"user01/tfjob-mnist-example-6rfqc8lm"} {"level":"info","ts":1686824442.5349317,"logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":"user01/tfjob-mnist-example","err":"Operation cannot be fulfilled on experiments.kubeflow.org \"tfjob-mnist-example\": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1686824444.4125206,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-2ff4vxph-worker-0","Trial":"tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824446.3909032,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-2ff4vxph-worker-1","Trial":"tfjob-mnist-example-2ff4vxph"} {"level":"info","ts":1686824448.3626342,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-dcdlnmwl-worker-0","Trial":"tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824450.3192122,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-dcdlnmwl-worker-1","Trial":"tfjob-mnist-example-dcdlnmwl"} {"level":"info","ts":1686824452.2932706,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-6rfqc8lm-worker-0","Trial":"tfjob-mnist-example-6rfqc8lm"} {"level":"info","ts":1686824454.251165,"logger":"injector-webhook","msg":"Inject metrics collector sidecar container","Pod Name":"tfjob-mnist-example-6rfqc8lm-worker-1","Trial":"tfjob-mnist-example-6rfqc8lm"}
Yaml file used to create this experiment
apiVersion: kubeflow.org/v1 kind: Experiment metadata: namespace: user01 name: tfjob-mnist-example spec: parallelTrialCount: 3 maxTrialCount: 8 maxFailedTrialCount: 3 objective: type: maximize goal: 0.7 objectiveMetricName: accuracy algorithm: algorithmName: random parameters:
In this example we can collect metrics only from the Worker pods.
primaryPodLabels: training.kubeflow.org/replica-type: worker trialParameters:
tfReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers:
Environment:
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍