ridgerchu / SpikeGPT

Implementation of "SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks"
BSD 2-Clause "Simplified" License
746 stars 77 forks source link

Linking paper and code #11

Closed adirajagopal closed 9 months ago

adirajagopal commented 10 months ago

Hi, I had a couple of questions on the paper as well as the link to the code here. 1) Do you have any materials on how you derived of Eq.10 in the paper from Eq.4? 2) I'm also a little unclear how the CUDA function "kernel_forward" in wkv_cuda.cu implements the Eq.10 - could you provide some pointers around that please?

Thanks!

ridgerchu commented 9 months ago

Hi!

For the first issue, please refer to Fig.2. In this structure, 'W' can be viewed as a convolution kernel of the same size as 'K/V', enabling convolution operations for parallel processing. We have an animation that illustrates this concept, and you can find it in this talk record between 25:00 to 28:00, where I discuss this particular problem in detail.

Regarding the second point, language models often encounter overflow issues during training. To address this, we introduced a new variable 'pp' to manage these overflow concerns. Nevertheless, the overall functionality aligns with what is described in Eq.10. For a more comprehensive explanation, please consult our RWKV EMNLP paper, particularly Eq.23-28, where this is thoroughly discussed.