Open albertz opened 1 year ago
After all the recent optimizations, now looking at the profiling (demo-rf-pt-benchmark.py
, using Py-Spy with --native
with GPU), I don't see any obvious low-level bottleneck anymore.
Sorted by total time:
Sorted by self-time:
The open points above are still possible optimizations we could do, but looking at these profiling stats, I think they will only give very minor speedup.
Overall:
Some things to notice:
compareOrCombineViaCached
(all bin ops like add, mul, etc, also comparisons like eq, etc): It's unexpected to me that this takes by far most of the time, twice as much as matmul
.matmul
: Most of it is via Linear, but then also a bit for attention. I would have expected that this would take most time.reduce
: mostly for LayerNorm/BatchNorm. You also see that the masking takes quite some time.linear
, layer_norm
, etc. (also check batch norm, dropout). That would reduce the bin ops quite a bit.Now for Librispeech, one subepoch (1/20 epoch) takes 16:40min on Nvidia 1080 GPU with a Conformer (setup/config), which is about the same as we see with TensorFlow for exactly the same model, as reported by @mmz33. (Same batch size, but RF PT is missing CTC yet.)
The computation time is 99% now, after #1383 was fixed and is used (num_workers=1
).
Still, the GPU utilization is only about 60% on average (high fluctuations, between 25-100%). We discussed that this is maybe not anymore due to RF/PT but just a principle property of this model, maybe the AED part, maybe the LSTM-based decoder. E.g. when using batch size 1, you probably would have no bottleneck due to RF/PT, and the GPU is maximally used from PT side, but still you would probably have only a small GPU utilization, as many cores would idle most of the time. A few ops (LSTM) also cannot be parallelized at all then.
So maybe we already did most of the work here on the RF optimization side, and it can be considered as done. Or even if we want to optimize this more, maybe scripting/tracing (#1436) makes more sense now at this point.
There are a few other orthogonal optimizations, which I might try next, as listed here: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations
Here I want to collect some things to be done to speed up eager-mode execution. Most of it did not really matter in graph-mode execution when those extra things are only executed once. That are extra checks, or just slightly inefficient handling.
It is mostly about the RF with PyTorch backend, but also about potentially other eager-mode backends (e.g. TF eager-mode), or also just about faster usage of our
Tensor
andDim
classes.__eq__
,__ne__
, Tensorget_axes_from_description
,get_axis_from_description
__eq__
__eq__
__hash__
raw_tensor
assigncopy
more efficient via betterget_kwargs
copy
, directly assign_raw_tensor
dims_set
, preferdims
, if possible__add__
etc: avoid_rf()
? more directrf.combine
call, or even directly backendcombine
? (#1403)convert_to_tensor
,bin_op_out_template
,combine_raw
, etc is inline in C++ -> #1403convert_to_tensor
also can have special code to allow for scalars and just keep them that way. (#1403)bin_op_out_template
can use some simpler variant ofcopy_compatible_to
which returns only the raw tensor, not aTensor
:copy_compatible_to_raw
(#1403)rf.combine
: avoidimport
(obsolete with #1403)combine
: fasterbin_op_out_template
, opt case of scalars (#1403)combine_raw
(obsolete with #1403)_raw_backend
(#1403)Tensor.copy_compatible_to
,Tensor.copy_compatible_to_raw
,Tensor.copy_compatible_to_dims
, mostly reusing the same logic as incombine
andcompare
if possible, but then also falling back to the generic code (#1404)(actually obsolete, we would mostly usecopy_compatible_to
copy_compatible_to_dims
instead, #1404)as an alternative tocopy_compatible_to_raw
copy_compatible_to
, common use case, avoids creating broadcastDim
objects (obsolete viacopy_compatible_to_dims_raw
, #1404)copy_compatible_to_dims
andcopy_compatible_to_dims_raw
as other simpler alternatives tocopy_compatible_to
, just acceptingdims
for the target (#1404)Tensor.copy_compatible_to_dims
andTensor.copy_compatible_to_dims_raw
native (#1409)get_dim_value_tensor
: we should cache the value, currently usesreduce_max
every call. See Dimreset_eager
logic. (#1414)Tensor.copy_transpose
native (partly #1413)Tensor.__init__
nativecache_seq_mask
not part of_extra
?_extra
used often because of broadcast dummy dims viacopy_compatible_to
. Can we avoid this?It's mostly because of. (Alsoauto_generated
derived_from_op
, probably others, not so rare.) We might avoid it when checking for kwargbatch=None
. What is the behavior ofauto_generated
in RF? Isauto_generated
actually correct forcopy_add_dim_by_tag
?Dim.__init__
nativeDim
equality (__eq__
,__hash__
) withoutderived_from_op
logic (#1418)Dim.__eq__
nativeDim.__hash__
nativeDim.get_same_base
nativematmul
reduce
get_dtype_name_raw
: Just the Torch dtyperepr
call takes time by creating the string object. Inside our C++ code, we could speed up by having just a lookup table for common dtypes. (#1433)Dim.copy_from
: We can share cachescache_dyn_size_ext_dev
andcache_seq_mask
. For self-attention, when we create a new copy of the dim, we would otherwise recompute the mask again and again. Note that this is a bit tricky for the cache though, as the dim tags do not match. (#1417)Dim
cache_seq_mask
could maybe also have thedim_order
in the key because this might otherwise still require copy_transpose for every usage. (#1417)__add__
,__mul__
etc) is not really so rare, thus optimizations:_extra
to main objectDo not use dim math in the RF? It heavily complicates things, and what is the benefit?Slightly simpler equality checks in some rare cases?On the other side, we commonly need to calculate dyn seq sizes in one way or another in any case. What is actually complicated? The simplification logic maybe only?The equality check?Edit: The equality check now is removed (#1418), and otherwise, we anyway need some similar functionality, so I think we can leave it like this for now.Linear
: fused op matmul + bias. In Torch, that istorch.nn.functional.linear
include
), so directly accessing the underlyingTHPVariable
orTensor
might be an option, or maybe using the C interface. But I'm not sure if this is worth it.For profiling, we can use many tools, but it's not trivial.
Tools:
Conceptually, we want to find multiple things:
To better measure just RF code, we also have some other options:
we only see RF code in the profiler, and see how much time we spent per step due to that.This is a bit tricky though because not all ops are supported, or some code might not expect this.Edit Ok, it was easier than expected, there were almost no non-supported ops. However, the assumption was wrong that ops on meta device would be for free. In the profiler, I see quite some percentage also in those. It seems it does some Python logic internally.Orthogonal:
PyTorch supports multiple ways to compile the code. So we can enable eager maybe for debugging, and/or the first few steps, and then compile a graph just like graph-mode in TF.
Options:
Tensor
class), except maybe for some very specific carefully selected functions. (#1436)torch.compile
Also see this overview for PyTorch in general: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations