Closed farisfirenze closed 1 year ago
The problem is with the image that you have created. It is not with Katib. Did you use GPU drivers in the image?
I am able to execute "nvidia-smi" on the image and get the correct output. For this to happen, shouldnt the drivers be installed in the image ? Just to be sure, can you provide me with details on how to use GPU drivers in the image ?
You can use Nvidia NGC containers based on your framework https://catalog.ngc.nvidia.com/containers
You can use Nvidia NGC containers based on your framework https://catalog.ngc.nvidia.com/containers
I have tried using the Nvidia NGC containers as mentioned below
FROM nvcr.io/nvidia/tensorflow:22.06-tf2-py3
RUN mkdir -p /opt/trainer
RUN pip show tensorflow
RUN pip install pandas
RUN pip install scikit-learn
RUN pip install google-cloud-storage
# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"
COPY *.py /opt/trainer/
ENTRYPOINT ["python", "/opt/trainer/task.py"]
PS : I have pulled the same image in both the containers of my pipeline
but I am still getting this problem
Also, I have a question. I am setting GPU limit on my pipeline component using .set_gpu_limit(1) as given below.
hp_tune = dsl.ContainerOp(
name='hp-tune-katib',
image=hyper_image_uri,
command=["python3", "/hp_tune/task.py"],
arguments=[
'--experiment_name', experiment_name,
'--experiment_namespace', experiment_namespace,
'--experiment_timeout_minutes', experiment_timeout_minutes,
'--delete_after_done', True,
'--hyper_image_uri', hyper_image_uri_train,
'--time_loc', time_loc,
'--model_uri', model_uri
],
file_outputs={'best-params': '/output.txt'}
).set_gpu_limit(1)
and the ARGO_CONTAINER is showing nvidia.com/gpu : 1
So, my question is that, Do I need to specify GPU request on my trial spec in katib as well like below ?
trial_spec = {
"apiVersion": "kubeflow.org/v1",
"kind": "TFJob",
"spec": {
"tfReplicaSpecs": {
"PS": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
]
}
]
}
}
},
"Worker": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
],
"resources" : {
"limits" : {
"nvidia.com/gpu" : 1
}
}
}
]
}
}
}
}
}
}
Also, I kindly request you to help me solve this GPU usage problem.
I haven't tried gpu limit with Pipelines.
Easiest way is to check the experiment yaml using kubectl. Trial Spec should need gpu limit if trial pod needs to access GPU.
This is what happens when I specify GPU request in the trial spec but not in the pipeline component.
This step is in Pending state with this message: Unschedulable: 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate.
This is my kubectl describe node
(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia$ kubectl describe node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Name: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=n1-highmem-8
beta.kubernetes.io/os=linux
cloud.google.com/gke-accelerator=nvidia-tesla-k80
cloud.google.com/gke-boot-disk=pd-standard
cloud.google.com/gke-container-runtime=containerd
cloud.google.com/gke-cpu-scaling-level=8
cloud.google.com/gke-max-pods-per-node=110
cloud.google.com/gke-nodepool=gpu-pool1
cloud.google.com/gke-os-distribution=cos
cloud.google.com/machine-family=n1
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-a
kubernetes.io/arch=amd64
kubernetes.io/hostname=gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
kubernetes.io/os=linux
node.kubernetes.io/instance-type=n1-highmem-8
topology.gke.io/zone=us-central1-a
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-a
Annotations: container.googleapis.com/instance_id: 609271750101604849
csi.volume.kubernetes.io/nodeid:
{"pd.csi.storage.gke.io":"projects/prj-vertex-ai/zones/us-central1-a/instances/gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j"}
node.alpha.kubernetes.io/ttl: 0
node.gke.io/last-applied-node-labels:
cloud.google.com/gke-accelerator=nvidia-tesla-k80,cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-container-runtime=contai...
node.gke.io/last-applied-node-taints: nvidia.com/gpu=present:NoSchedule
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 15 Jul 2022 08:37:52 +0000
Taints: nvidia.com/gpu=present:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
AcquireTime: <unset>
RenewTime: Fri, 15 Jul 2022 08:52:28 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
CorruptDockerOverlay2 False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
FrequentUnregisterNetDevice False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentUnregisterNetDevice node is functioning properly
FrequentKubeletRestart False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 NoFrequentContainerdRestart containerd is functioning properly
KernelDeadlock False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Fri, 15 Jul 2022 08:48:00 +0000 Fri, 15 Jul 2022 08:37:57 +0000 FilesystemIsNotReadOnly Filesystem is not read-only
NetworkUnavailable False Fri, 15 Jul 2022 08:37:52 +0000 Fri, 15 Jul 2022 08:37:52 +0000 RouteCreated NodeController create implicit route
MemoryPressure False Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:49 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:49 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:49 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 15 Jul 2022 08:49:24 +0000 Fri, 15 Jul 2022 08:37:52 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.128.0.14
ExternalIP: 34.171.4.196
InternalDNS: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j.us-central1-a.c.prj-vertex-ai.internal
Hostname: gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j.us-central1-a.c.prj-vertex-ai.internal
Capacity:
attachable-volumes-gce-pd: 127
cpu: 8
ephemeral-storage: 98868448Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 53477620Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
attachable-volumes-gce-pd: 127
cpu: 7910m
ephemeral-storage: 47093746742
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 48425204Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: 27109359572b62f3c535daadb9e9c398
System UUID: 27109359-572b-62f3-c535-daadb9e9c398
Boot ID: cb1e0e37-2556-4f81-b0a8-b93a5105f484
Kernel Version: 5.10.90+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.5.4
Kubelet Version: v1.22.8-gke.202
Kube-Proxy Version: v1.22.8-gke.202
PodCIDR: 10.8.1.0/24
PodCIDRs: 10.8.1.0/24
ProviderID: gce://prj-vertex-ai/us-central1-a/gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system fluentbit-gke-kjmds 100m (1%) 0 (0%) 200Mi (0%) 500Mi (1%) 14m
kube-system gke-metrics-agent-zqm94 3m (0%) 0 (0%) 50Mi (0%) 50Mi (0%) 14m
kube-system kube-proxy-gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j 100m (1%) 0 (0%) 0 (0%) 0 (0%) 14m
kube-system nvidia-driver-installer-hw2lx 150m (1%) 0 (0%) 0 (0%) 0 (0%) 14m
kube-system nvidia-gpu-device-plugin-ln587 50m (0%) 0 (0%) 50Mi (0%) 50Mi (0%) 14m
kube-system pdcsi-node-2nlmc 10m (0%) 0 (0%) 20Mi (0%) 100Mi (0%) 14m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 413m (5%) 0 (0%)
memory 320Mi (0%) 700Mi (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-gce-pd 0 0
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 14m kube-proxy
Normal Starting 14m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 14m (x4 over 14m) kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 14m (x4 over 14m) kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 14m (x4 over 14m) kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 14m kubelet Updated Node Allocatable limit across pods
Warning InvalidDiskCapacity 14m kubelet invalid capacity 0 on image filesystem
Normal NodeReady 14m kubelet Node gke-kubeflow-pipelines-stan-gpu-pool1-a65c281b-4r3j status is now: NodeReady
Warning ContainerdStart 14m (x2 over 14m) systemd-monitor Starting containerd container runtime...
Warning DockerStart 14m (x3 over 14m) systemd-monitor Starting Docker Application Container Engine...
Warning KubeletStart 14m systemd-monitor Started Kubernetes kubelet.
Any idea how I can add toleration to this taint and make the pod allocate GPU ?
This is my pod yaml
(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia/hp_tune$ kubectl get pods kubectl get pod dbpedia-exp-8-g4pvh4fc-worker-0 -o yaml -n kubeflow
apiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
annotations:
sidecar.istio.io/inject: "false"
creationTimestamp: "2022-07-15T09:57:26Z"
labels:
group-name: kubeflow.org
job-name: dbpedia-exp-8-g4pvh4fc
replica-index: "0"
replica-type: worker
training.kubeflow.org/job-name: dbpedia-exp-8-g4pvh4fc
training.kubeflow.org/job-role: master
training.kubeflow.org/operator-name: tfjob-controller
training.kubeflow.org/replica-index: "0"
training.kubeflow.org/replica-type: worker
name: dbpedia-exp-8-g4pvh4fc-worker-0
namespace: kubeflow
ownerReferences:
- apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: TFJob
name: dbpedia-exp-8-g4pvh4fc
uid: 7401591a-e7f3-4036-823e-b63437fed795
resourceVersion: "39305"
uid: 5b974f29-4379-41ff-90dd-b51c6d04d189
spec:
containers:
- args:
- python /opt/trainer/task.py --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
--batch_size=32 --learning_rate=0.004570666890885507 1>/var/log/katib/metrics.log
2>&1 && echo completed > /var/log/katib/$$$$.pid
command:
- sh
- -c
env:
- name: TF_CONFIG
value: '{"cluster":{"ps":["dbpedia-exp-8-g4pvh4fc-ps-0.kubeflow.svc:2222"],"worker":["dbpedia-exp-8-g4pvh4fc-worker-0.kubeflow.svc:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}'
image: gcr.io/........./hptunekatib:v14
imagePullPolicy: IfNotPresent
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xvtgc
readOnly: true
- mountPath: /var/log/katib
name: metrics-volume
- args:
- -t
- dbpedia-exp-8-g4pvh4fc
- -m
- accuracy
- -o-type
- maximize
- -s-db
- katib-db-manager.kubeflow:6789
- -path
- /var/log/katib/metrics.log
image: docker.io/kubeflowkatib/file-metrics-collector:v0.13.0
imagePullPolicy: IfNotPresent
name: metrics-logger-and-collector
resources:
limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/katib
name: metrics-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xvtgc
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
shareProcessNamespace: true
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: example-key
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- name: kube-api-access-xvtgc
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
- emptyDir: {}
name: metrics-volume
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-07-15T09:57:26Z"
message: '0/2 nodes are available: 2 Insufficient nvidia.com/gpu.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable
and this is my katib experiment yaml
(base) jupyter@tensorflow-2-6-new:~/katib/dbpedia/hp_tune$ kubectl get experiment dbpedia-exp-8 -o yaml -n kubeflow
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
creationTimestamp: "2022-07-15T09:57:05Z"
finalizers:
- update-prometheus-metrics
generation: 1
name: dbpedia-exp-8
namespace: kubeflow
resourceVersion: "39293"
uid: ded49060-e00e-4b57-8fd1-f40af2ec162e
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 2
maxTrialCount: 2
metricsCollectorSpec:
collector:
kind: StdOut
objective:
metricStrategies:
- name: accuracy
value: max
objectiveMetricName: accuracy
type: maximize
parallelTrialCount: 1
parameters:
- feasibleSpace:
list:
- "32"
- "42"
- "52"
- "62"
- "64"
name: batch_size
parameterType: discrete
- feasibleSpace:
max: "0.005"
min: "0.001"
name: learning_rate
parameterType: double
resumePolicy: LongRunning
trialTemplate:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
primaryContainerName: tensorflow
primaryPodLabels:
training.kubeflow.org/job-role: master
successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
trialParameters:
- description: batch size
name: batchSize
reference: batch_size
- description: Learning rate
name: learningRate
reference: learning_rate
trialSpec:
apiVersion: kubeflow.org/v1
kind: TFJob
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python
- /opt/trainer/task.py
- --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
- --batch_size=${trialParameters.batchSize}
- --learning_rate=${trialParameters.learningRate}
image: gcr.io/............/hptunekatib:v14
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python
- /opt/trainer/task.py
- --model_uri=gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/
- --batch_size=${trialParameters.batchSize}
- --learning_rate=${trialParameters.learningRate}
image: gcr.io/........./hptunekatib:v14
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
key: example-key
operator: Exists
status:
conditions:
- lastTransitionTime: "2022-07-15T09:57:05Z"
lastUpdateTime: "2022-07-15T09:57:05Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
- lastTransitionTime: "2022-07-15T09:57:26Z"
lastUpdateTime: "2022-07-15T09:57:26Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
observation: {}
runningTrialList:
- dbpedia-exp-8-g4pvh4fc
startTime: "2022-07-15T09:57:05Z"
trials: 1
trialsRunning: 1
even though it shows running.. it will timeout eventually.
What am I missing here ?
This is not specific to Katib. It means that trials could not find a node which satisfies these resource requirements to start the pod One thing to note: When you add resource requirements to trial spec, every trial pod will try to request the same set of resources when run in parallel. Eg: If trialSpec has 1 GPU requirement and if experimentSpec allows 3 parallelTrials, then each trial pod will request 1 GPU each(total of 3 GPUs)
Here is the gist of my working sample, you can ignore the node selector stuff, it just helps to schedule the pod on the gpu node I want (dedicated for training in my case) :
trial_spec={
"apiVersion": "batch/v1",
"kind": "Job",
"spec": {
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false"
}
},
"spec": {
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "k8s.amazonaws.com/accelerator",
"operator": "In",
"values": [
"nvidia-tesla-v100"
]
},
{
"key": "ai-gpu-2",
"operator": "In",
"values": [
"true"
]
}
]
}
]
}
}
},
"containers": [
{
"resources" : {
"limits" : {
"nvidia.com/gpu" : 1
}
},
"name": training_container_name,
"image": "xxxxxxxxxxxxxxxxxxxxx__YOUR_IMAGE_HERE_xxxxxxxxxxxxxx",
"imagePullPolicy": "Always",
"command": train_params + [
"--learning_rate=${trialParameters.learning_rate}",
"--optimizer=${trialParameters.optimizer}",
"--batch_size=${trialParameters.batch_size}",
"--max_epochs=${trialParameters.max_epochs}"
]
}
],
"restartPolicy": "Never",
"serviceAccountName": "default-editor"
}
}
}
}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Feel free to re-open an issue if you have any followup problems.
/kind bug
What steps did you take and what happened: I am trying to create a kubeflow pipeline that tunes the hyper parameters of a text classification model in tensorflow using katib on GKE clusters. I created a cluster using the below commands
I then created a kubeflow pipeline:
These are my two continers.
gcr.io/.............../hptunekatibclient:v7
Dockerfile
gcr.io/.............../hptunekatib:v7
Dockerfile
The pipeline runs but it doesnot use the GPU and this piece of code
gives empty list and empty string. It is like the GPU doesnot exist. I am attaching the logs of the container
What did you expect to happen:
I expected the pipeline stage to use GPU and run the text classiication using GPU but it doesnt.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
): 1.22.8-gke.202uname -a
): linux/ COS in containersImpacted by this bug? Give it a 👍 We prioritize the issues with the most 👍