Closed jeswan closed 4 years ago
Comment by W4ngatang Friday Jun 29, 2018 at 15:23 GMT
this will probably be a biggish issue... Even the base RNNs are taking quite a while (> 1 day), largely because even training just the classifiers on QQP, QNLI, and MNLI takes quite a while.
Comment by sleepinyourhat Friday Jun 29, 2018 at 15:28 GMT
While I wait on other things, I'll look into this. Worst case, drop down to 1000D?
Comment by W4ngatang Friday Jun 29, 2018 at 15:39 GMT
yea, our early results indicate small difference between 1000D and 1500D
Other knobs to fiddle with:
Will think of better speedups beyond shrinking the model...
Comment by W4ngatang Friday Jun 29, 2018 at 15:50 GMT
Comment by sleepinyourhat Friday Jun 29, 2018 at 16:35 GMT
Ran cProfile. Nothing seems like low-hanging fruit for improvement.
Comment by sleepinyourhat Friday Jun 29, 2018 at 17:08 GMT
Smart batching + smart unrolling could be fairly big if we're not doing it. 2-3x?
Comment by sleepinyourhat Friday Jun 29, 2018 at 17:38 GMT
I can get a 1.75x speedup by moving to 1000D for the main RNN and 256D for the attn RNN, but I'm on the fence. Just shrinking the main RNN alone doesn't speed things up as much (1.4x), and I worry that making the attn RNN too tiny will start to change the behavior of the model. The big char CNN puts a pretty low upper bound on speed.
One layer will definitely change behavior. I vote no.
Comment by W4ngatang Friday Jun 29, 2018 at 18:08 GMT
How big is the charCNN? We can cache its outputs like Ian had mentioned
What are you voting for / on?
Comment by iftenney Friday Jun 29, 2018 at 18:14 GMT
Dynamic in-memory cache of char encodings wouldn't be impossible, if we want to avoid precomputing & dump to disk. Could use a pytorch expert for help there :)
Comment by sleepinyourhat Friday Jun 29, 2018 at 18:30 GMT
I'm a soft no on char caching—it sounds like the kind of thing that could be a biggish source of bugs in exchange for a fairly small speedup.
Comment by sleepinyourhat Friday Jun 29, 2018 at 18:31 GMT
Clarifying the above, I don't support moving to a one-layer LSTM.
Comment by sleepinyourhat Friday Jun 29, 2018 at 18:36 GMT
Switched to 1024D. Not sure there's that much else we should do.
Issue by sleepinyourhat Thursday Jun 28, 2018 at 21:36 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/64
Not top priority, but the largest model gets about 150 steps per minute, so a large training run (500k steps) could take two or three days. If anyone has spare bandwidth, do some CPU profiling and make sure we're not wasting time on anything. If you're very bored, try some GPU profiling too, though I doubt there's much to optimize there.