nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models
MIT License
21 stars 9 forks source link

[CLOSED] Worry about speed #64

Closed jeswan closed 4 years ago

jeswan commented 4 years ago

Issue by sleepinyourhat Thursday Jun 28, 2018 at 21:36 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/64


Not top priority, but the largest model gets about 150 steps per minute, so a large training run (500k steps) could take two or three days. If anyone has spare bandwidth, do some CPU profiling and make sure we're not wasting time on anything. If you're very bored, try some GPU profiling too, though I doubt there's much to optimize there.

jeswan commented 4 years ago

Comment by W4ngatang Friday Jun 29, 2018 at 15:23 GMT


this will probably be a biggish issue... Even the base RNNs are taking quite a while (> 1 day), largely because even training just the classifiers on QQP, QNLI, and MNLI takes quite a while.

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jun 29, 2018 at 15:28 GMT


While I wait on other things, I'll look into this. Worst case, drop down to 1000D?

jeswan commented 4 years ago

Comment by W4ngatang Friday Jun 29, 2018 at 15:39 GMT


yea, our early results indicate small difference between 1000D and 1500D

Other knobs to fiddle with:

Will think of better speedups beyond shrinking the model...

jeswan commented 4 years ago

Comment by W4ngatang Friday Jun 29, 2018 at 15:50 GMT


jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jun 29, 2018 at 16:35 GMT


Ran cProfile. Nothing seems like low-hanging fruit for improvement.

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jun 29, 2018 at 17:08 GMT


Smart batching + smart unrolling could be fairly big if we're not doing it. 2-3x?

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jun 29, 2018 at 17:38 GMT


I can get a 1.75x speedup by moving to 1000D for the main RNN and 256D for the attn RNN, but I'm on the fence. Just shrinking the main RNN alone doesn't speed things up as much (1.4x), and I worry that making the attn RNN too tiny will start to change the behavior of the model. The big char CNN puts a pretty low upper bound on speed.

One layer will definitely change behavior. I vote no.

jeswan commented 4 years ago

Comment by W4ngatang Friday Jun 29, 2018 at 18:08 GMT


How big is the charCNN? We can cache its outputs like Ian had mentioned

What are you voting for / on?

jeswan commented 4 years ago

Comment by iftenney Friday Jun 29, 2018 at 18:14 GMT


Dynamic in-memory cache of char encodings wouldn't be impossible, if we want to avoid precomputing & dump to disk. Could use a pytorch expert for help there :)

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jun 29, 2018 at 18:30 GMT


I'm a soft no on char caching—it sounds like the kind of thing that could be a biggish source of bugs in exchange for a fairly small speedup.

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jun 29, 2018 at 18:31 GMT


Clarifying the above, I don't support moving to a one-layer LSTM.

jeswan commented 4 years ago

Comment by sleepinyourhat Friday Jun 29, 2018 at 18:36 GMT


Switched to 1024D. Not sure there's that much else we should do.