Worry about speed - Githubissues

nyu-mll / jiant

jiant is an nlp toolkit

https://jiant.info

MIT License

1.65k stars 297 forks source link

Worry about speed #64

Closed sleepinyourhat closed 6 years ago

sleepinyourhat commented 6 years ago

Not top priority, but the largest model gets about 150 steps per minute, so a large training run (500k steps) could take two or three days. If anyone has spare bandwidth, do some CPU profiling and make sure we're not wasting time on anything. If you're very bored, try some GPU profiling too, though I doubt there's much to optimize there.

W4ngatang commented 6 years ago

this will probably be a biggish issue... Even the base RNNs are taking quite a while (> 1 day), largely because even training just the classifiers on QQP, QNLI, and MNLI takes quite a while.

sleepinyourhat commented 6 years ago

While I wait on other things, I'll look into this. Worst case, drop down to 1000D?

W4ngatang commented 6 years ago

yea, our early results indicate small difference between 1000D and 1500D

Other knobs to fiddle with:

2 layers > 1 layer
batch size 32 -> 64 (assuming cuts other places allow us to fit this in memory)

Will think of better speedups beyond shrinking the model...

W4ngatang commented 6 years ago

using smart batching (BucketIterator in AllenNLP instead of BasicIterator)

sleepinyourhat commented 6 years ago

Ran cProfile. Nothing seems like low-hanging fruit for improvement.

We spend a 30% of training time on 'any' (in NaN checking), but removing that line doesn't change training speed at all. I think it's just the line to get blocked while waiting for GPU computation.
We pend about a quarter of training time pytorch_seq2seq_wrapper.
We appear to spend 16% of training time on 'sum', but those calls all come from AllenNLP code (this method: https://github.com/allenai/allennlp/blob/master/allennlp/modules/encoder_base.py#L32 ). I'll trust that that's not worth our time to optimize.
We spend about 10% of training time on running char embeddings forward.
We spend about 8% of training time on {method 'cpu' of 'torch._C._TensorBase' objects}. That's odd. The only cpu() calls there are copying the loss to CPU, so hopefully this is also a blocking issue.

sleepinyourhat commented 6 years ago

Smart batching + smart unrolling could be fairly big if we're not doing it. 2-3x?

sleepinyourhat commented 6 years ago

I can get a 1.75x speedup by moving to 1000D for the main RNN and 256D for the attn RNN, but I'm on the fence. Just shrinking the main RNN alone doesn't speed things up as much (1.4x), and I worry that making the attn RNN too tiny will start to change the behavior of the model. The big char CNN puts a pretty low upper bound on speed.

One layer will definitely change behavior. I vote no.

W4ngatang commented 6 years ago

How big is the charCNN? We can cache its outputs like Ian had mentioned

What are you voting for / on?

iftenney commented 6 years ago

Dynamic in-memory cache of char encodings wouldn't be impossible, if we want to avoid precomputing & dump to disk. Could use a pytorch expert for help there :)

sleepinyourhat commented 6 years ago

I'm a soft no on char caching—it sounds like the kind of thing that could be a biggish source of bugs in exchange for a fairly small speedup.

sleepinyourhat commented 6 years ago

Clarifying the above, I don't support moving to a one-layer LSTM.

sleepinyourhat commented 6 years ago

Switched to 1024D. Not sure there's that much else we should do.