Open Fritschek opened 1 week ago
Hi @KTibow and @lucidrains, great work. I am working on similar implementations and noticed the same as @Fritschek mentioned. The speed seems to not holding the same standards to a point where the traditional GRU learns better and faster. Would this be tied to a hyperparameter that must be tunned, the dataset you used (mine is trajectories - lat/lon), or the specs of the hardware you used? Thanks!
I'm in touch with one of the authors, he mentioned that they used their own implemented classical GRU version, to have a fair comparison without any optimization. That would explain a lot I think. Also probably, CPU/GPU differences. Furthermore, I tested it for longer sequences (in the few thousand), and minGRU gets better there (or rather torch GRU considerably worse)
MinGRU (without the LM layers) is considerably slower than standard nn.GRU. My test parameters were: input_size = 10, hidden_size = 100, seq_len = 1000, batch_size = 64.
From my profiler, tested in Google colab, the major/top time sinks looked like this:
Profiling Results for MinGRU Model:
I don't see where the authors get the 175x increase in speed. Especially, since they also recommend logcumsumexp and where functions and so on. In my testing, it's half as fast. And yes, I know we cannot compare to torch efficient code, but this big of a difference? Any clue why this is the case?