Closed user-ZJ closed 4 years ago
@user-ZJ nvidia-smi
and nvcc
are installed correctly?
Maybe your GPU is not compatible with tensorflow-gpu 2.2.0.
Can you try different versions of tensorflow-gpu such as:
This should work with 2.0:
pip uninstall -y tensorflow-gpu && pip install tensorflow-gpu==2.0
Try also 2.1 as well.
pip uninstall -y tensorflow-gpu && pip install tensorflow-gpu==2.1
@philipperemy I have reinstall tensorflow-gpu==2.0
pip list | grep tensorflow tensorflow-estimator 2.0.1
tensorflow-gpu 2.0.0
the train process is running on GPU
But I have another problem. The GPU only used 10%.Is that normal?
nvidia-smi Sat Jun 13 11:28:16 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A | | 0% 37C P8 11W / 250W | 606MiB / 11016MiB | 10% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1094 G /usr/lib/xorg/Xorg 14MiB | | 0 2024 G /usr/lib/xorg/Xorg 149MiB | | 0 2170 G /usr/bin/gnome-shell 158MiB | | 0 8094 C python 153MiB | | 0 19153 G ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files 125MiB | +-----------------------------------------------------------------------------+
Nice GPU! @user-ZJ how much system memory do you have? I had 32GB of memory. It's possible that you're running out of memory and it slows down the training. This part is not very optimized. RTX 2080 is a great GPU so hard to be a bottleneck.
@philipperemy
I clone the 20200611 code.
This is my hardware usage info when I run ./deepspeaker train_triplet CPU(intel i7 8 core):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8094 zack 20 0 20.224g 0.010t 101492 S 734.4 16.3 6767:56 python
GPU: only use 10%
MEMORY:
free -h total used free shared buff/cache available Mem: 62G 13G 1.1G 542M 48G 48G Swap: 63G 214M 63G
train log:
Train for 2000 steps, validate for 200 steps Epoch 1/1000 2000/2000 [==============================] - 11822s 6s/step - loss: 0.6875 - val_loss: 0.6616 Epoch 2/1000 2000/2000 [==============================] - 11786s 6s/step - loss: 0.6556 - val_loss: 0.6421 Epoch 3/1000 2000/2000 [==============================] - 11718s 6s/step - loss: 0.6432 - val_loss: 0.6334 Epoch 4/1000 2000/2000 [==============================] - 11669s 6s/step - loss: 0.6348 - val_loss: 0.6253 Epoch 5/1000 2000/2000 [==============================] - 11582s 6s/step - loss: 0.6288 - val_loss: 0.6193 Epoch 6/1000 121/2000 [>.............................] - ETA: 2:59:06 - loss: 0.6251
I want to know why the GPU usage is low?Is cpu limit the usage of GPU?
@user-ZJ yeah so the train triplet is heavily dependent on the speed of your disk. Do you have a fast SSD?
All the triplets are read from the disk and formed at each batch.
@philipperemy thanks for reply. All my data is in SSD but the GPU userate is low.
I try to train triplet is slow like upon.And I run ./deepspeaker train_softmax is also slow.It need 8h/eporch.
Epoch 1/1000 2208/552576 [..............................] - ETA: 8:35:30
Do you have any suggestion to speed up the training? thanks!
Sorry @user-ZJ I don't have any idea why it's slow on your machine. Are you using exactly the same dataset (LibriSpeech)?
You can see my training logs for information:
@philipperemy I also use LibriSpeech dataset. I will try to debug for this,thanks!
Good luck :)
For info, it worked great with my desktop: i7-8770K, 32GB memory and GTX 1070 with CUDA 10.
I uninstall tensorflow and build tensorflow with source code. Then I run ./deepspeaker softmax_train,it works well. I guess it cause by cuda version(my cuda version is 10.2). Thanks.
@user-ZJ awesome!!! tes probably your version is not CUDA 10.0 and it required a custom tensorflow build (from the sources). Now I remember I had this problem when I installed CUDA 10.1. I eventually reverted to 10.0 not to have the hassle of building my own wheels.
I have install tensorflow-gpu.But when run './deepspeaker train_softmax',I find it only use cpu. I don't know why. Can you give me some points.
''' pip list | grep tensorflow tensorflow-estimator 2.2.0
tensorflow-gpu 2.2.0 '''