Closed leanderme closed 5 years ago
I had experienced similar poor GPU loading. In my experiments, there were four things that improved my GPU load, but still, the load was around 40%.
The things that helped me were - a) using more CPU's, helped improve file IO. 4 CPU's worked the best for this DCASE data. b) I had to turn off the use_multiprocessing=False
flag, it gave 4x improvements in comparison to the flag set to True
. c) increased the number of epochs per fit_generator() call and d) increased batch size. Even after all this, the GPU was most of the time waiting for the data, and when it got the data, it would process it no time. I just couldn't get the right configuration to overcome the file IO bottleneck. In case you do find a good balance, let me know, I will update this code accordingly.
Thank you for the fast reply! I noticed a perfomance drop using multiprocssing, too. I think there is a lot of confusion about this in the community. In various github issues most of the users reported no perfomance gains. I'll try again with the sequential api, but I suspect this won't change anything. Others recommended using h5py. I'll try that too and report results!
I investigated further and changing to sequential api and using h5py did not improve the training speed. However, I noticed a significant performance gain using CuDNNGRU in place of GRU. The downside of this is that Recurrent dropout is not implemented in cuDNN RNN ops (see this issue), apart from experimental implementations.
Thanks for your experiments @leanderme It will help others trying similar approaches.
Hi, I've a data set of ~ 3000 audio files (all zero padded to 60 secs). The data didn't fit into memory, so I've adapted your sed-crnn repository to use the data generator from this project. I only care aboud the sed labels, so I've removed everything related to Sound event localization
With
seq_len = 512
andbatch_size = 512
, it still takes roughly 130 seconds / epoch (~ 3s/step). I'm using a 1080 TI, the system RAM is 64 GB. With these params, the GPU usage is around 30%.My question Did you experience similar training times? Did you manage to increase load on your GPU while training? I'm wondering if the data generator is the issue here? Did you experiment with multiprocessing?