lucidrains / minGRU-pytorch

Implementation of the proposed minGRU in Pytorch
MIT License
250 stars 17 forks source link

minGRU only half as fast as torch GRU in tests #13

Open Fritschek opened 1 week ago

Fritschek commented 1 week ago

MinGRU (without the LM layers) is considerably slower than standard nn.GRU. My test parameters were: input_size = 10, hidden_size = 100, seq_len = 1000, batch_size = 64.

From my profiler, tested in Google colab, the major/top time sinks looked like this:

Profiling Results for MinGRU Model:

Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
aten::_logcumsumexp 0.22% 281.061us 0.56% 705.686us 47.046us 15.687ms 13.07% 15.822ms 1.055ms 15
aten::neg 0.50% 632.931us 0.73% 930.204us 16.913us 12.112ms 10.09% 12.112ms 220.218us 55
aten::flip 0.45% 575.464us 1.10% 1.387ms 46.243us 7.588ms 6.32% 7.858ms 261.933us 30
aten::cumsum 0.17% 216.370us 0.25% 315.877us 31.588us 7.441ms 6.20% 7.461ms 746.100us 10
aten::add 0.25% 316.109us 0.34% 428.045us 17.122us 7.272ms 6.06% 7.272ms 290.880us 25
aten::where 0.62% 789.732us 1.70% 2.154ms 61.545us 6.919ms 5.76% 10.195ms 291.286us 35

I don't see where the authors get the 175x increase in speed. Especially, since they also recommend logcumsumexp and where functions and so on. In my testing, it's half as fast. And yes, I know we cannot compare to torch efficient code, but this big of a difference? Any clue why this is the case?

gabrielspadon commented 5 days ago

Hi @KTibow and @lucidrains, great work. I am working on similar implementations and noticed the same as @Fritschek mentioned. The speed seems to not holding the same standards to a point where the traditional GRU learns better and faster. Would this be tied to a hyperparameter that must be tunned, the dataset you used (mine is trajectories - lat/lon), or the specs of the hardware you used? Thanks!

Fritschek commented 5 days ago

I'm in touch with one of the authors, he mentioned that they used their own implemented classical GRU version, to have a fair comparison without any optimization. That would explain a lot I think. Also probably, CPU/GPU differences. Furthermore, I tested it for longer sequences (in the few thousand), and minGRU gets better there (or rather torch GRU considerably worse)