Speedup eager execution

albertz commented 1 year ago

Here I want to collect some things to be done to speed up eager-mode execution. Most of it did not really matter in graph-mode execution when those extra things are only executed once. That are extra checks, or just slightly inefficient handling.

It is mostly about the RF with PyTorch backend, but also about potentially other eager-mode backends (e.g. TF eager-mode), or also just about faster usage of our Tensor and Dim classes.

[x] f09222e89cfe3cec4bb96dcf6d7aa289840ecb13: Tensor/Dim, some fast path optimizations: Dim __eq__, __ne__, Tensor get_axes_from_description, get_axis_from_description
[x] fa9818c, #1396: Dim __eq__
[x] 01d0653, #1399: Dim __eq__
[x] dc14a2c, #1400: Dim __hash__
[x] 361e238, #1401: Tensor raw_tensor assign
[x] abff2a8 (49b69ed): Tensor copy more efficient via better get_kwargs
[x] 07078b9: Tensor copy, directly assign _raw_tensor
[x] 2e104f5: Tensor avoid dims_set, prefer dims, if possible
[x] Tensor __add__ etc: avoid _rf()? more direct rf.combine call, or even directly backend combine? (#1403)
- [x] Even directly implement this in C++, template for all variants, make sure that the common case (eager framework) does not need to get back to Python at all -> #1403
- [x] All involved logic like convert_to_tensor, bin_op_out_template, combine_raw, etc is inline in C++ -> #1403
- [x] convert_to_tensor also can have special code to allow for scalars and just keep them that way. (#1403)
- [x] bin_op_out_template can use some simpler variant of copy_compatible_to which returns only the raw tensor, not a Tensor: copy_compatible_to_raw (#1403)
[x] rf.combine: avoid import (obsolete with #1403)
[x] RF backend combine: faster bin_op_out_template, opt case of scalars (#1403)
[x] RF PT combine_raw (obsolete with #1403)
[x] Tensor _raw_backend (#1403)
[x] Tensor.copy_compatible_to, Tensor.copy_compatible_to_raw, Tensor.copy_compatible_to_dims, mostly reusing the same logic as in combine and compare if possible, but then also falling back to the generic code (#1404)
- [x] ~~copy_compatible_to~~ (actually obsolete, we would mostly use copy_compatible_to_dims instead, #1404)
- [x] ~~copy_compatible_to_raw~~ as an alternative to copy_compatible_to, common use case, avoids creating broadcast Dim objects (obsolete via copy_compatible_to_dims_raw, #1404)
- [x] copy_compatible_to_dims and copy_compatible_to_dims_raw as other simpler alternatives to copy_compatible_to, just accepting dims for the target (#1404)
[x] Tensor.copy_compatible_to_dims and Tensor.copy_compatible_to_dims_raw native (#1409)
[x] Dim get_dim_value_tensor: we should cache the value, currently uses reduce_max every call. See Dim reset_eager logic. (#1414)
[ ] Tensor.copy_transpose native (partly #1413)
[ ] Tensor.__init__ native
[ ] Dim cache_seq_mask not part of _extra?
[ ] Dim _extra used often because of broadcast dummy dims via copy_compatible_to. Can we avoid this? ~~It's mostly because of auto_generated~~. (Also derived_from_op, probably others, not so rare.) We might avoid it when checking for kwarg batch=None. What is the behavior of auto_generated in RF? Is auto_generated actually correct for copy_add_dim_by_tag?
[ ] Dim.__init__ native
[x] Dim equality (__eq__, __hash__) without derived_from_op logic (#1418)
[ ] Dim.__eq__ native
[ ] Dim.__hash__ native
[ ] Dim.get_same_base native
[ ] Torch backend matmul
[ ] Torch backend reduce
[x] Torch backend get_dtype_name_raw: Just the Torch dtype repr call takes time by creating the string object. Inside our C++ code, we could speed up by having just a lookup table for common dtypes. (#1433)
[x] Dim.copy_from: We can share caches cache_dyn_size_ext_dev and cache_seq_mask. For self-attention, when we create a new copy of the dim, we would otherwise recompute the mask again and again. Note that this is a bit tricky for the cache though, as the dim tags do not match. (#1417)
[x] Dim cache_seq_mask could maybe also have the dim_order in the key because this might otherwise still require copy_transpose for every usage. (#1417)
[ ] Dim math (__add__, __mul__ etc) is not really so rare, thus optimizations:
- [x] Dim math cache (#1416)
- [ ] Optimize code of dim math
- [ ] Native implementations
- [ ] Move from _extra to main object
- [x] ~~Do not use dim math in the RF? It heavily complicates things~~, and what is the benefit? ~~Slightly simpler equality checks in some rare cases?~~ On the other side, we commonly need to calculate dyn seq sizes in one way or another in any case. What is actually complicated? The simplification logic maybe only? ~~The equality check?~~ Edit: The equality check now is removed (#1418), and otherwise, we anyway need some similar functionality, so I think we can leave it like this for now.
[ ] Linear: fused op matmul + bias. In Torch, that is torch.nn.functional.linear
[ ] PyTorch provides header files inside the Python package (in include), so directly accessing the underlying THPVariable or Tensor might be an option, or maybe using the C interface. But I'm not sure if this is worth it.

For profiling, we can use many tools, but it's not trivial.

Tools:

Python generic: PySpy, cProfile etc. Also see here.
PyTorch profiler, lots of different settings

Conceptually, we want to find multiple things:

Are there any inefficient ops, unwanted GPU->CPU transfers, etc?
What is the overhead due to our Python code, due to RF? Where do we have inefficient code here?

To better measure just RF code, we also have some other options:

Use the "meta" device, which does not perform any computation. Then ~~we only see RF code in the profiler~~, and see how much time we spent per step due to that. ~~This is a bit tricky though because not all ops are supported, or some code might not expect this.~~ Edit Ok, it was easier than expected, there were almost no non-supported ops. However, the assumption was wrong that ops on meta device would be for free. In the profiler, I see quite some percentage also in those. It seems it does some Python logic internally.
Use very low dimensions and on CPU only, so that the ops are so fast that this time is not relevant.

Orthogonal:

PyTorch supports multiple ways to compile the code. So we can enable eager maybe for debugging, and/or the first few steps, and then compile a graph just like graph-mode in TF.

Options:

TorchScript via scripting. Only very limited subset of Python supported, basically not really an option for RF (it does not support our Tensor class), except maybe for some very specific carefully selected functions. (#1436)
TorchScript via tracing. Everything supported except control flow (cond, loop). Control flow could be via scripting maybe? (#1436)
torch.compile
I think there are more...

Also see this overview for PyTorch in general: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations

albertz commented 1 year ago

After all the recent optimizations, now looking at the profiling (demo-rf-pt-benchmark.py, using Py-Spy with --native with GPU), I don't see any obvious low-level bottleneck anymore.

Sorted by total time:

Sorted by self-time:

The open points above are still possible optimizations we could do, but looking at these profiling stats, I think they will only give very minor speedup.

Overall:

Some things to notice:

I have about 85% computing time, i.e. it means the dataset is slow here, taking 15% of the time. But this might be just the toy dataset being used here.
From the lower-level RF functions, there is (in this order):
- 24%: compareOrCombineViaCached (all bin ops like add, mul, etc, also comparisons like eq, etc): It's unexpected to me that this takes by far most of the time, twice as much as matmul.
- 12%: matmul: Most of it is via Linear, but then also a bit for attention. I would have expected that this would take most time.
- 8.5%: reduce: mostly for LayerNorm/BatchNorm. You also see that the masking takes quite some time.
Looking more on the module level, what takes quite a bit of time is dropout and layernorm/batchnorm.
We probably can improve more by using some of the fused functional ops from PyTorch. E.g. linear, layer_norm, etc. (also check batch norm, dropout). That would reduce the bin ops quite a bit.

albertz commented 1 year ago

Now for Librispeech, one subepoch (1/20 epoch) takes 16:40min on Nvidia 1080 GPU with a Conformer (setup/config), which is about the same as we see with TensorFlow for exactly the same model, as reported by @mmz33. (Same batch size, but RF PT is missing CTC yet.) The computation time is 99% now, after #1383 was fixed and is used (num_workers=1). Still, the GPU utilization is only about 60% on average (high fluctuations, between 25-100%). We discussed that this is maybe not anymore due to RF/PT but just a principle property of this model, maybe the AED part, maybe the LSTM-based decoder. E.g. when using batch size 1, you probably would have no bottleneck due to RF/PT, and the GPU is maximally used from PT side, but still you would probably have only a small GPU utilization, as many cores would idle most of the time. A few ops (LSTM) also cannot be parallelized at all then.

So maybe we already did most of the work here on the RF optimization side, and it can be considered as done. Or even if we want to optimize this more, maybe scripting/tracing (#1436) makes more sense now at this point.

There are a few other orthogonal optimizations, which I might try next, as listed here: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations

rwth-i6 / returnn

Speedup eager execution #1402