rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

Speedup eager execution #1402

Open albertz opened 1 year ago

albertz commented 1 year ago

Here I want to collect some things to be done to speed up eager-mode execution. Most of it did not really matter in graph-mode execution when those extra things are only executed once. That are extra checks, or just slightly inefficient handling.

It is mostly about the RF with PyTorch backend, but also about potentially other eager-mode backends (e.g. TF eager-mode), or also just about faster usage of our Tensor and Dim classes.


For profiling, we can use many tools, but it's not trivial.

Tools:

Conceptually, we want to find multiple things:

To better measure just RF code, we also have some other options:


Orthogonal:

PyTorch supports multiple ways to compile the code. So we can enable eager maybe for debugging, and/or the first few steps, and then compile a graph just like graph-mode in TF.

Options:

Also see this overview for PyTorch in general: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations

albertz commented 1 year ago

After all the recent optimizations, now looking at the profiling (demo-rf-pt-benchmark.py, using Py-Spy with --native with GPU), I don't see any obvious low-level bottleneck anymore.

Sorted by total time:

image

Sorted by self-time:

image

The open points above are still possible optimizations we could do, but looking at these profiling stats, I think they will only give very minor speedup.

Overall:

image

Some things to notice:

albertz commented 1 year ago

Now for Librispeech, one subepoch (1/20 epoch) takes 16:40min on Nvidia 1080 GPU with a Conformer (setup/config), which is about the same as we see with TensorFlow for exactly the same model, as reported by @mmz33. (Same batch size, but RF PT is missing CTC yet.) The computation time is 99% now, after #1383 was fixed and is used (num_workers=1). Still, the GPU utilization is only about 60% on average (high fluctuations, between 25-100%). We discussed that this is maybe not anymore due to RF/PT but just a principle property of this model, maybe the AED part, maybe the LSTM-based decoder. E.g. when using batch size 1, you probably would have no bottleneck due to RF/PT, and the GPU is maximally used from PT side, but still you would probably have only a small GPU utilization, as many cores would idle most of the time. A few ops (LSTM) also cannot be parallelized at all then.

So maybe we already did most of the work here on the RF optimization side, and it can be considered as done. Or even if we want to optimize this more, maybe scripting/tracing (#1436) makes more sense now at this point.

There are a few other orthogonal optimizations, which I might try next, as listed here: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations