Closed pokerfaceSad closed 4 years ago
I find it will hung if call tf.Session()
Can you provide processes the stack /proc/
/proc//stack
Here it is
$ sudo cat /proc/19384/stack
[<ffffffff81104de2>] futex_wait_queue_me+0xc2/0x120
[<ffffffff81105936>] futex_wait+0x116/0x280
[<ffffffff81107dd6>] do_futex+0x126/0x540
[<ffffffff81108271>] SyS_futex+0x81/0x180
[<ffffffff8185328e>] entry_SYSCALL_64_fastpath+0x22/0xc1
[<ffffffffffffffff>] 0xffffffffffffffff
What's the version of your gpu-manager? Did you run your program in non-root mode?
What's the version of your gpu-manager? Did you run your program in non-root mode?
tkestack/gpu-manager:1.1.0
.tensorflow/tensorflow:1.13.1-gpu-py3
image to run tf code, it runs as root mode by default. Will it matter?What's the version of your gpu-manager? Did you run your program in non-root mode?
- The gpu-manager-daemonset image is
tkestack/gpu-manager:1.1.0
.- I use
tensorflow/tensorflow:1.13.1-gpu-py3
image to run tf code, it runs as root mode by default. Will it matter?
It's weird. I've run the your example python script and it works.
Can you provide these information?
What's the version of your gpu-manager? Did you run your program in non-root mode?
- The gpu-manager-daemonset image is
tkestack/gpu-manager:1.1.0
.- I use
tensorflow/tensorflow:1.13.1-gpu-py3
image to run tf code, it runs as root mode by default. Will it matter?It's weird. I've run the your example python script and it works.
Can you provide these information?
- Container runtime
- ps aux|grep "gpu-"
- gpu-manager.yaml
nvidia-container-runtime
root 38186 3.2 0.0 1420780 34864 ? Sl 10:10 0:00 /usr/bin/gpu-manager --extra-config=/etc/gpu-manager/extra-config.json --v=5 --hostname-override=gpu42 --share-mode=true --volume-config=/etc/gpu-manager/volume.conf --log-dir=/var/log/gpu-manager --query-addr=0.0.0.0 --logtostderr=false
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-manager-daemonset
namespace: kube-system
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
name: gpu-manager-ds
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: gpu-manager-ds
spec:
serviceAccount: gpu-manager
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: tencent.com/vcuda-core
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
# only run node hash gpu device
nodeSelector:
nvidia-device-enable: enable
hostPID: true
containers:
- image: tkestack/gpu-manager:1.1.0
imagePullPolicy: IfNotPresent
name: gpu-manager
securityContext:
privileged: true
ports:
- containerPort: 5678
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vdriver
mountPath: /etc/gpu-manager/vdriver
- name: vmdata
mountPath: /etc/gpu-manager/vm
- name: log
mountPath: /var/log/gpu-manager
- name: run-dir
mountPath: /var/run
- name: cgroup
mountPath: /sys/fs/cgroup
readOnly: true
- name: usr-directory
mountPath: /usr/local/host
readOnly: true
env:
- name: LOG_LEVEL
value: "5"
- name: EXTRA_FLAGS
value: "--logtostderr=false"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: device-plugin
hostPath:
type: Directory
path: /var/lib/kubelet/device-plugins
- name: vmdata
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vm
- name: vdriver
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vdriver
- name: log
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/log
# We have to mount the whole /var/run directory into container, because of bind mount docker.sock
# inode change after host docker is restarted
- name: run-dir
hostPath:
type: Directory
path: /var/run
- name: cgroup
hostPath:
type: Directory
path: /sys/fs/cgroup
# We have to mount /usr directory instead of specified library path, because of non-existing
# problem for different distro
- name: usr-directory
hostPath:
type: Directory
path: /usr
@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version.
the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1
one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1
one works well.
I'm not sure if gpu type and gpu driver version lead this problem.
@mYmNeo I think it stuck in this this loop, because it can't get correct gpu utilization for some reasons.
@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the
Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1
one doesn't work( works well if not use vcuda-controller), theTesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1
one works well.I'm not sure if gpu type and gpu driver version lead this problem.
Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.
@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the
Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1
one doesn't work( works well if not use vcuda-controller), theTesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1
one works well. I'm not sure if gpu type and gpu driver version lead this problem.Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.
Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣.
By the way, do gpu-manager
and vcuda-controller
accept code contributions from third-party developers? I think they are good projects, may be I can do something.
@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the
Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1
one doesn't work( works well if not use vcuda-controller), theTesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1
one works well. I'm not sure if gpu type and gpu driver version lead this problem.Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.
Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do
gpu-manager
andvcuda-controller
accept code contributions from third-party developers? I think they are good projects, may be I can do something.
Definitely yes. Welcome to any fascinating ideas and PR.
@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the
Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1
one doesn't work( works well if not use vcuda-controller), theTesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1
one works well. I'm not sure if gpu type and gpu driver version lead this problem.Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.
Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do
gpu-manager
andvcuda-controller
accept code contributions from third-party developers? I think they are good projects, may be I can do something.Definitely yes. Welcome to any fascinating ideas and PR.
Great, I will close this issue.
pod.yaml
Here is my .yaml file for creating pod
training code
Here is my tensorflow code, just a simple CNN
problem
The program hung up after output
Created TensorFlow device
log
Here is the log, it repeat output
Hijacking nvml...