Closed sleepinyourhat closed 6 years ago
this will probably be a biggish issue... Even the base RNNs are taking quite a while (> 1 day), largely because even training just the classifiers on QQP, QNLI, and MNLI takes quite a while.
While I wait on other things, I'll look into this. Worst case, drop down to 1000D?
yea, our early results indicate small difference between 1000D and 1500D
Other knobs to fiddle with:
Will think of better speedups beyond shrinking the model...
Ran cProfile. Nothing seems like low-hanging fruit for improvement.
Smart batching + smart unrolling could be fairly big if we're not doing it. 2-3x?
I can get a 1.75x speedup by moving to 1000D for the main RNN and 256D for the attn RNN, but I'm on the fence. Just shrinking the main RNN alone doesn't speed things up as much (1.4x), and I worry that making the attn RNN too tiny will start to change the behavior of the model. The big char CNN puts a pretty low upper bound on speed.
One layer will definitely change behavior. I vote no.
How big is the charCNN? We can cache its outputs like Ian had mentioned
What are you voting for / on?
Dynamic in-memory cache of char encodings wouldn't be impossible, if we want to avoid precomputing & dump to disk. Could use a pytorch expert for help there :)
I'm a soft no on char caching—it sounds like the kind of thing that could be a biggish source of bugs in exchange for a fairly small speedup.
Clarifying the above, I don't support moving to a one-layer LSTM.
Switched to 1024D. Not sure there's that much else we should do.
Not top priority, but the largest model gets about 150 steps per minute, so a large training run (500k steps) could take two or three days. If anyone has spare bandwidth, do some CPU profiling and make sure we're not wasting time on anything. If you're very bored, try some GPU profiling too, though I doubt there's much to optimize there.