minGRU only half as fast as torch GRU in tests

Fritschek commented 1 week ago

MinGRU (without the LM layers) is considerably slower than standard nn.GRU. My test parameters were: input_size = 10, hidden_size = 100, seq_len = 1000, batch_size = 64.

From my profiler, tested in Google colab, the major/top time sinks looked like this:

Profiling Results for MinGRU Model:

Name	Self CPU %	Self CPU	CPU total %	CPU total	CPU time avg	Self CUDA	Self CUDA %	CUDA total	CUDA time avg	# of Calls
aten::_logcumsumexp	0.22%	281.061us	0.56%	705.686us	47.046us	15.687ms	13.07%	15.822ms	1.055ms	15
aten::neg	0.50%	632.931us	0.73%	930.204us	16.913us	12.112ms	10.09%	12.112ms	220.218us	55
aten::flip	0.45%	575.464us	1.10%	1.387ms	46.243us	7.588ms	6.32%	7.858ms	261.933us	30
aten::cumsum	0.17%	216.370us	0.25%	315.877us	31.588us	7.441ms	6.20%	7.461ms	746.100us	10
aten::add	0.25%	316.109us	0.34%	428.045us	17.122us	7.272ms	6.06%	7.272ms	290.880us	25
aten::where	0.62%	789.732us	1.70%	2.154ms	61.545us	6.919ms	5.76%	10.195ms	291.286us	35

I don't see where the authors get the 175x increase in speed. Especially, since they also recommend logcumsumexp and where functions and so on. In my testing, it's half as fast. And yes, I know we cannot compare to torch efficient code, but this big of a difference? Any clue why this is the case?

gabrielspadon commented 5 days ago

Hi @KTibow and @lucidrains, great work. I am working on similar implementations and noticed the same as @Fritschek mentioned. The speed seems to not holding the same standards to a point where the traditional GRU learns better and faster. Would this be tied to a hyperparameter that must be tunned, the dataset you used (mine is trajectories - lat/lon), or the specs of the hardware you used? Thanks!

Fritschek commented 5 days ago

I'm in touch with one of the authors, he mentioned that they used their own implemented classical GRU version, to have a fair comparison without any optimization. That would explain a lot I think. Also probably, CPU/GPU differences. Furthermore, I tested it for longer sequences (in the few thousand), and minGRU gets better there (or rather torch GRU considerably worse)

lucidrains / minGRU-pytorch

minGRU only half as fast as torch GRU in tests #13

Profiling Results for MinGRU Model: