nyu-mll / jiant

jiant is an nlp toolkit
https://jiant.info
MIT License
1.65k stars 297 forks source link

Worry about speed #64

Closed sleepinyourhat closed 6 years ago

sleepinyourhat commented 6 years ago

Not top priority, but the largest model gets about 150 steps per minute, so a large training run (500k steps) could take two or three days. If anyone has spare bandwidth, do some CPU profiling and make sure we're not wasting time on anything. If you're very bored, try some GPU profiling too, though I doubt there's much to optimize there.

W4ngatang commented 6 years ago

this will probably be a biggish issue... Even the base RNNs are taking quite a while (> 1 day), largely because even training just the classifiers on QQP, QNLI, and MNLI takes quite a while.

sleepinyourhat commented 6 years ago

While I wait on other things, I'll look into this. Worst case, drop down to 1000D?

W4ngatang commented 6 years ago

yea, our early results indicate small difference between 1000D and 1500D

Other knobs to fiddle with:

Will think of better speedups beyond shrinking the model...

W4ngatang commented 6 years ago
sleepinyourhat commented 6 years ago

Ran cProfile. Nothing seems like low-hanging fruit for improvement.

sleepinyourhat commented 6 years ago

Smart batching + smart unrolling could be fairly big if we're not doing it. 2-3x?

sleepinyourhat commented 6 years ago

I can get a 1.75x speedup by moving to 1000D for the main RNN and 256D for the attn RNN, but I'm on the fence. Just shrinking the main RNN alone doesn't speed things up as much (1.4x), and I worry that making the attn RNN too tiny will start to change the behavior of the model. The big char CNN puts a pretty low upper bound on speed.

One layer will definitely change behavior. I vote no.

W4ngatang commented 6 years ago

How big is the charCNN? We can cache its outputs like Ian had mentioned

What are you voting for / on?

iftenney commented 6 years ago

Dynamic in-memory cache of char encodings wouldn't be impossible, if we want to avoid precomputing & dump to disk. Could use a pytorch expert for help there :)

sleepinyourhat commented 6 years ago

I'm a soft no on char caching—it sounds like the kind of thing that could be a biggish source of bugs in exchange for a fairly small speedup.

sleepinyourhat commented 6 years ago

Clarifying the above, I don't support moving to a one-layer LSTM.

sleepinyourhat commented 6 years ago

Switched to 1024D. Not sure there's that much else we should do.