Closed richardbaihe closed 3 years ago
This paper introduce to replace transformer's softmax self-attention with random feature attention, which is a kernal approximation method. The calculation is shown below:
where equation 4 is supported by this theorem:
Here is the results:
This paper introduce to replace transformer's softmax self-attention with random feature attention, which is a kernal approximation method. The calculation is shown below:
where equation 4 is supported by this theorem:
Here is the results: