Using dropout makes training very slow

I tried using dropout and it slows down training a lot for me (like 10x slower).

Looking at net/bilstm-parallel-layer.h:177, this code is the only difference compared to training the network without dropout:

  if (drop_factor_ != 0.0) {
    drop_mask_.Resize(T*S, 2 * cell_dim_, kUndefined);
    drop_mask_.SetRandUniform();
    drop_mask_.Add(-drop_factor_);
    drop_mask_.ApplyHeaviside();
    YR_RB.RowRange(S,T*S).MulElements(drop_mask_);
  }

I can see that the dropout code calls SetRandUniform(), which in turn creates a new CuRand tmp; on every call. Might this slow down training because the GPU is reseeded every time? I also suspect that seeding is done on the host, as this comments in gpucompute/cuda-rand.cc suggests it:

// optionally re-seed the inner state 
// (this is done in host, for good performance it is better to avoid re-seeding)
int32 tgt_size = probs.num_rows_ * probs.stride_;
if (tgt_size != state_size_) SeedGpu(tgt_size);

I will try to reuse the CuRand object and see what happens... meanwhile, if someone else encountered the same problem, let me know.

srvk / eesen

Using dropout makes training very slow #111