Closed lkikinn closed 10 months ago
This is the result of the precision-speed trade-off.
I understand that some people say that the act part can be seen as attention weight, and the whole can be seen as a process of a simple attention mechanism, like a * w + b.I don't know if that's right。
Yes, the implementation of inject is a variant of attention, and of course the specific operators used to implement inject can be changed according to the actual situation, without being limited to the attention structure
Why is the Inject Module divided into two parts?