srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 342 forks source link

Using dropout makes training very slow #111

Open bmilde opened 7 years ago

bmilde commented 7 years ago

I tried using dropout and it slows down training a lot for me (like 10x slower).

Looking at net/bilstm-parallel-layer.h:177, this code is the only difference compared to training the network without dropout:

  if (drop_factor_ != 0.0) {
    drop_mask_.Resize(T*S, 2 * cell_dim_, kUndefined);
    drop_mask_.SetRandUniform();
    drop_mask_.Add(-drop_factor_);
    drop_mask_.ApplyHeaviside();
    YR_RB.RowRange(S,T*S).MulElements(drop_mask_);
  }

I can see that the dropout code calls SetRandUniform(), which in turn creates a new CuRand tmp; on every call. Might this slow down training because the GPU is reseeded every time? I also suspect that seeding is done on the host, as this comments in gpucompute/cuda-rand.cc suggests it:

// optionally re-seed the inner state 
// (this is done in host, for good performance it is better to avoid re-seeding)
int32 tgt_size = probs.num_rows_ * probs.stride_;
if (tgt_size != state_size_) SeedGpu(tgt_size);

I will try to reuse the CuRand object and see what happens... meanwhile, if someone else encountered the same problem, let me know.

bmilde commented 7 years ago

Reusing the CuRand object does help! I'm guessing that seeding the CuRand object was happening on the host, not on the GPU. However my quick fix will only work, if all the layers that use Dropout have the same size.