philipperemy / deep-speaker

Deep Speaker: an End-to-End Neural Speaker Embedding System.
MIT License
905 stars 241 forks source link

how to use gpu #58

Closed user-ZJ closed 4 years ago

user-ZJ commented 4 years ago

I have install tensorflow-gpu.But when run './deepspeaker train_softmax',I find it only use cpu. I don't know why. Can you give me some points.

''' pip list | grep tensorflow tensorflow-estimator 2.2.0
tensorflow-gpu 2.2.0 '''

philipperemy commented 4 years ago

@user-ZJ nvidia-smi and nvcc are installed correctly? Maybe your GPU is not compatible with tensorflow-gpu 2.2.0. Can you try different versions of tensorflow-gpu such as:

This should work with 2.0:

pip uninstall -y tensorflow-gpu && pip install tensorflow-gpu==2.0

Try also 2.1 as well.

pip uninstall -y tensorflow-gpu && pip install tensorflow-gpu==2.1
user-ZJ commented 4 years ago

@philipperemy I have reinstall tensorflow-gpu==2.0

pip list | grep tensorflow tensorflow-estimator 2.0.1
tensorflow-gpu 2.0.0

the train process is running on GPU

But I have another problem. The GPU only used 10%.Is that normal?

nvidia-smi Sat Jun 13 11:28:16 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A | | 0% 37C P8 11W / 250W | 606MiB / 11016MiB | 10% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1094 G /usr/lib/xorg/Xorg 14MiB | | 0 2024 G /usr/lib/xorg/Xorg 149MiB | | 0 2170 G /usr/bin/gnome-shell 158MiB | | 0 8094 C python 153MiB | | 0 19153 G ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files 125MiB | +-----------------------------------------------------------------------------+

philipperemy commented 4 years ago

Nice GPU! @user-ZJ how much system memory do you have? I had 32GB of memory. It's possible that you're running out of memory and it slows down the training. This part is not very optimized. RTX 2080 is a great GPU so hard to be a bottleneck.

user-ZJ commented 4 years ago

@philipperemy

I clone the 20200611 code.

This is my hardware usage info when I run ./deepspeaker train_triplet CPU(intel i7 8 core):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8094 zack 20 0 20.224g 0.010t 101492 S 734.4 16.3 6767:56 python

GPU: only use 10%

MEMORY:

free -h total used free shared buff/cache available Mem: 62G 13G 1.1G 542M 48G 48G Swap: 63G 214M 63G

train log:

Train for 2000 steps, validate for 200 steps Epoch 1/1000 2000/2000 [==============================] - 11822s 6s/step - loss: 0.6875 - val_loss: 0.6616 Epoch 2/1000 2000/2000 [==============================] - 11786s 6s/step - loss: 0.6556 - val_loss: 0.6421 Epoch 3/1000 2000/2000 [==============================] - 11718s 6s/step - loss: 0.6432 - val_loss: 0.6334 Epoch 4/1000 2000/2000 [==============================] - 11669s 6s/step - loss: 0.6348 - val_loss: 0.6253 Epoch 5/1000 2000/2000 [==============================] - 11582s 6s/step - loss: 0.6288 - val_loss: 0.6193 Epoch 6/1000 121/2000 [>.............................] - ETA: 2:59:06 - loss: 0.6251

I want to know why the GPU usage is low?Is cpu limit the usage of GPU?

philipperemy commented 4 years ago

@user-ZJ yeah so the train triplet is heavily dependent on the speed of your disk. Do you have a fast SSD?

philipperemy commented 4 years ago

All the triplets are read from the disk and formed at each batch.

user-ZJ commented 4 years ago

@philipperemy thanks for reply. All my data is in SSD but the GPU userate is low.

I try to train triplet is slow like upon.And I run ./deepspeaker train_softmax is also slow.It need 8h/eporch.

Epoch 1/1000 2208/552576 [..............................] - ETA: 8:35:30

Do you have any suggestion to speed up the training? thanks!

philipperemy commented 4 years ago

Sorry @user-ZJ I don't have any idea why it's slow on your machine. Are you using exactly the same dataset (LibriSpeech)?

You can see my training logs for information:

user-ZJ commented 4 years ago

@philipperemy I also use LibriSpeech dataset. I will try to debug for this,thanks!

philipperemy commented 4 years ago

Good luck :)

For info, it worked great with my desktop: i7-8770K, 32GB memory and GTX 1070 with CUDA 10.

user-ZJ commented 4 years ago

I uninstall tensorflow and build tensorflow with source code. Then I run ./deepspeaker softmax_train,it works well. I guess it cause by cuda version(my cuda version is 10.2). Thanks.

philipperemy commented 4 years ago

@user-ZJ awesome!!! tes probably your version is not CUDA 10.0 and it required a custom tensorflow build (from the sources). Now I remember I had this problem when I installed CUDA 10.1. I eventually reverted to 10.0 not to have the hassle of building my own wheels.