I can see that the dropout code calls SetRandUniform(), which in turn creates a new CuRand tmp; on every call. Might this slow down training because the GPU is reseeded every time? I also suspect that seeding is done on the host, as this comments in gpucompute/cuda-rand.cc suggests it:
// optionally re-seed the inner state
// (this is done in host, for good performance it is better to avoid re-seeding)
int32 tgt_size = probs.num_rows_ * probs.stride_;
if (tgt_size != state_size_) SeedGpu(tgt_size);
I will try to reuse the CuRand object and see what happens... meanwhile, if someone else encountered the same problem, let me know.
Reusing the CuRand object does help! I'm guessing that seeding the CuRand object was happening on the host, not on the GPU. However my quick fix will only work, if all the layers that use Dropout have the same size.
I tried using dropout and it slows down training a lot for me (like 10x slower).
Looking at net/bilstm-parallel-layer.h:177, this code is the only difference compared to training the network without dropout:
I can see that the dropout code calls SetRandUniform(), which in turn creates a new CuRand tmp; on every call. Might this slow down training because the GPU is reseeded every time? I also suspect that seeding is done on the host, as this comments in gpucompute/cuda-rand.cc suggests it:
I will try to reuse the CuRand object and see what happens... meanwhile, if someone else encountered the same problem, let me know.