Question about the paper

Hi,

The notation in line 5 is a shorthand for "we are sampling all the tokens in parallel". Technically it is not a single probability distribution but for each i_th token, you have a p_ith distribution over the vocabulary conditioned on all the past tokens. Also the conditioning y_1:m is shifted right. As explained at the end of Appendix A, it is pseudocode and some details are omitted.

If you look at the implementation, basically it consists of sampling from the model https://github.com/teelinsan/parallel-decoding/blob/19769bf5aa2a41f02d6b0344aa1ee88e3d59cfa4/src/ipi/decoders/gs_jacobi.py#L68-L75 and then doing argmax with dimension -1 https://github.com/teelinsan/parallel-decoding/blob/19769bf5aa2a41f02d6b0344aa1ee88e3d59cfa4/src/ipi/decoders/gs_jacobi.py#L82

Hope this help.

Andrea

teelinsan / parallel-decoding

Question about the paper #1