Closed lengrongfu closed 1 year ago
@archlitchi Can you help me? After code location, I found that it is The patch pod is successful but the data is not written to pod.
$ kubectl get pods vgpu-deploy2-758579bd97-7cwbv -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduling.k8s.io/group-name: podgroup-4731d0a9-c9c8-4f95-b1ac-89918d660385
volcano.sh/resource-group: cpu
creationTimestamp: "2023-06-16T09:10:46Z"
generateName: vgpu-deploy2-758579bd97-
labels:
app: vgpu2
pod-template-hash: 758579bd97
name: vgpu-deploy2-758579bd97-7cwbv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: vgpu-deploy2-758579bd97
uid: 4731d0a9-c9c8-4f95-b1ac-89918d660385
resourceVersion: "535286"
uid: 35197671-900b-45d0-8fed-a7732308130e
spec:
containers:
- args:
- "6000"
image: chrstnhntschl/gpu_burn
imagePullPolicy: Always
name: vgpu2
resources:
limits:
volcano.sh/vgpu-memory: "1024"
volcano.sh/vgpu-number: "1"
requests:
volcano.sh/vgpu-memory: "1024"
volcano.sh/vgpu-number: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-dfc6z
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: 10-29-4-48
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: volcano
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-dfc6z
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
message: Pod Allocate failed due to no containers return in allocation response
&AllocateResponse{ContainerResponses:[]*ContainerAllocateResponse{},}, which is
unexpected
phase: Failed
reason: UnexpectedAdmissionError
startTime: "2023-06-16T09:12:13Z"
apply deployment content:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vgpu-deploy2
annotations:
volcano.sh/resource-group: cpu
labels:
app: vgpu2
spec:
replicas: 2
selector:
matchLabels:
app: vgpu2
template:
metadata:
annotations:
volcano.sh/resource-group: cpu
labels:
app: vgpu2
spec:
#schedulerName: volcano
containers:
- name: vgpu2
image: chrstnhntschl/gpu_burn
args:
- "6000"
resources:
limits:
volcano.sh/vgpu-number: 1
volcano.sh/vgpu-memory: 1024
can you run example in vGPU mode correctly?
I use yaml to apply.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod12
spec:
schedulerName: volcano
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
volcano.sh/vgpu-number: 1 # requesting 2 vGPUs
volcano.sh/vgpu-memory: 2000
#volcano.sh/vgpu-memory-percentage: 50 #Each vGPU containers 50% device memory of that GPU. Can not be used with nvidia.com/gpumem
- name: ubuntu-container0
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
- name: ubuntu-container1
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
volcano.sh/vgpu-number: 1 # requesting 2 vGPUs
volcano.sh/vgpu-memory: 3000
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gpu-pod12 0/3 UnexpectedAdmissionError 0 7s
that's weird, if the annotation patch step is unsuccessful, then it should return an error and fail the schedule
Is it possible that the submitted path action kube-apiserver
failed to execute?
Is it possible that the submitted path action
kube-apiserver
failed to execute?
can you show me the log of "vgpu-device-plugin" regarding the example you just submitted?
same errors for me while using vgpu plugin. any actions can help us?
same errors for me while using vgpu plugin. any actions can help us?
check if you have "NodeName" field in your task, if you do ,please use "nodeselector" instead. If you don't please show me the log of device-plugin
this is vcjob yaml:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: tensorflow-dist-mnist
spec:
minAvailable: 3
schedulerName: volcano
plugins:
env: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
queue: ai
tasks:
- replicas: 1
name: ps
template:
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
python /var/tf_dist_mnist/dist_mnist.py
image: volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
restartPolicy: Never
- replicas: 2
name: worker
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
python /var/tf_dist_mnist/dist_mnist.py
image: volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
volcano.sh/vgpu-number: 2
tolerations:
- effect: NoSchedule
key: volcano.sh/gpu
operator: Exists
restartPolicy: Never
and I don't find gpu memory info in node describe:
Capacity:
cpu: 8
ephemeral-storage: 51539404Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 40823904Ki
nvidia.com/gpu: 1
pods: 95
tke.cloud.tencent.com/eip: 2
tke.cloud.tencent.com/eni-ip: 95
volcano.sh/vgpu-number: 10
Allocatable:
cpu: 7800m
ephemeral-storage: 47498714648
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 36486240Ki
nvidia.com/gpu: 1
pods: 95
tke.cloud.tencent.com/eip: 2
tke.cloud.tencent.com/eni-ip: 6
volcano.sh/vgpu-number: 10
here are node volcano-device-plugin pod stdout log
I0706 08:05:20.344025 1 main.go:77] Loading NVML
I0706 08:05:20.347260 1 main.go:91] Starting FS watcher.
I0706 08:05:20.347441 1 main.go:98] Starting OS watcher.
I0706 08:05:20.354087 1 main.go:116] Retreiving plugins.
I0706 08:05:20.354152 1 register.go:101] into WatchAndRegister
2023/07/06 08:05:20 Starting GRPC server for 'volcano.sh/vgpu-number'
2023/07/06 08:05:20 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/07/06 08:05:20 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
I0706 08:05:20.371750 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:05:20.371740679 +0000 UTC m=+0.033841141
I0706 08:05:50.404241 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:05:50.404232127 +0000 UTC m=+30.066332579
I0706 08:06:20.438079 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:06:20.438070059 +0000 UTC m=+60.100170526
I0706 08:06:50.469563 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:06:50.469554195 +0000 UTC m=+90.131654661
when I apply vcjob to allocate vgpu-num volcano-device-plugin pod whill restart and log error:
I0706 08:03:26.139587 1 main.go:77] Loading NVML
I0706 08:03:26.143063 1 main.go:91] Starting FS watcher.
I0706 08:03:26.143258 1 main.go:98] Starting OS watcher.
I0706 08:03:26.149805 1 main.go:116] Retreiving plugins.
I0706 08:03:26.150118 1 register.go:101] into WatchAndRegister
2023/07/06 08:03:26 Starting GRPC server for 'volcano.sh/vgpu-number'
2023/07/06 08:03:26 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/07/06 08:03:26 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
I0706 08:03:26.178483 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:03:26.178475867 +0000 UTC m=+0.046181881
I0706 08:03:56.214284 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:03:56.214274187 +0000 UTC m=+30.081980204
I0706 08:04:26.261105 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:04:26.261095335 +0000 UTC m=+60.128801381
I0706 08:04:56.293621 1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:04:56.293612407 +0000 UTC m=+90.161318431
I0706 08:05:19.926804 1 plugin.go:309] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-f0803cfd-7b91-b063-5b9d-50c146197a89-4 GPU-f0803cfd-7b91-b063-5b9d-50c146197a89-0],}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]
goroutine 15 [running]:
volcano.sh/k8s-device-plugin/pkg/plugin/vgpu4pd.(*NvidiaDevicePlugin).Allocate(0xc0001420a0, {0x14cfee0, 0xc0003311a0}, 0xc000040be0)
/go/src/volcano.sh/devices/pkg/plugin/vgpu4pd/plugin.go:326 +0x353
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x12920a0?, 0xc0001420a0}, {0x14cfee0, 0xc0003311a0}, 0xc0002eac60, 0x0)
/go/pkg/mod/k8s.io/kubelet@v0.18.2/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000103040, {0x14d4df8, 0xc00053c000}, 0xc000443200, 0xc000128180, 0x1ce57f8, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:1082 +0xcab
google.golang.org/grpc.(*Server).handleStream(0xc000103040, {0x14d4df8, 0xc00053c000}, 0xc000443200, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:1405 +0xa13
google.golang.org/grpc.(*Server).serveStreams.func1.1()
/go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:746 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:744 +0xea
@nabanbaba can you restart device-plugin, and try example.yaml? see if it works.
after resatrt plugin ds use example run a pod alse same error. pod yaml:
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
restartPolicy: OnFailure
schedulerName: volcano
tolerations:
- effect: NoSchedule
key: volcano.sh/gpu
operator: Exists
containers:
- image: nvidia/cuda:10.1-base-ubuntu18.04
name: pod1-ctr
command: ["sleep"]
args: ["100000"]
resources:
limits:
volcano.sh/vgpu-number: 1
and ... device plugin not found volcano.sh/vgpu-memory resource. node info not have this resource.
that is normal, because gpumem is more like a parameter for "vgpu" rather than a standalone device resource. And it should be ignored by scheduler.
after resatrt plugin ds use example run a pod alse same error. pod yaml:
apiVersion: v1 kind: Pod metadata: name: pod1 spec: restartPolicy: OnFailure schedulerName: volcano tolerations: - effect: NoSchedule key: volcano.sh/gpu operator: Exists containers: - image: nvidia/cuda:10.1-base-ubuntu18.04 name: pod1-ctr command: ["sleep"] args: ["100000"] resources: limits: volcano.sh/vgpu-number: 1
please describe this pod, see if annotations is properly modified by scheduler.
describe info:
+ kubectl describe pods pod1
Name: pod1
Namespace: volcano-system
Priority: 0
Service Account: default
Node: 10.122.2.3/
Start Time: Thu, 06 Jul 2023 16:59:05 +0800
Labels: <none>
Annotations: scheduling.k8s.io/group-name: podgroup-623f0e8a-364a-4dd1-8da3-0c0e9ef55239
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod was rejected: Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected
IP:
IPs: <none>
Containers:
pod1-ctr:
Image: nvidia/cuda:10.1-base-ubuntu18.04
Port: <none>
Host Port: <none>
Command:
sleep
Args:
100000
Limits:
tke.cloud.tencent.com/eni-ip: 1
volcano.sh/vgpu-number: 2
Requests:
tke.cloud.tencent.com/eni-ip: 1
volcano.sh/vgpu-number: 2
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r89v6 (ro)
Volumes:
kube-api-access-r89v6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
volcano.sh/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6s volcano Successfully assigned volcano-system/pod1 to 10.122.2.3
Warning UnexpectedAdmissionError 6s kubelet Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected
so, can you suggest me , Should I go back to the old version of GPU shared plugin
so, can you suggest me , Should I go back to the old version of GPU shared plugin
Can you add my wechat "xuanzong4493" for further inspection?
已发送邀请
solved, you need to download vc-scheduler:latest instead of vc-scheduler:1.7.0, because vgpu is a new feature for v1.8, and is not embedded in v1.7