ridgerchu / matmulfreellm

Implementation for MatMul-free LM.
Apache License 2.0
2.5k stars 139 forks source link

MLGRU #26

Open AACengineer opened 1 week ago

AACengineer commented 1 week ago

From the computational formula of MLGRU, it is observed that the parallelism between tokens is disrupted during the prefill phase, whereas Transformer++ is able to maintain the parallelism between tokens, and I have two questions:

  1. latency in Figure 4->(d) means First token latency?
  2. And in Figure 4->(d) , Transfomer++ utilizes token parallelism?
yzhangcs commented 1 week ago

@AACengineer Hi, Transformer++ conducts decoding also in an autoregressive manner. During training, Transformer++ can be fully parallelized. However, we can also make use of the parallel scan to improve the token parallelism. And cuz the linear-time GRU requires much less FLOPs than self attn, our training efficiency can be much better. Also, GRU does not need KV cache, the decoding space complexity is O(1).