Closed MeWannaSleep closed 11 months ago
Hi,
The notation in line 5 is a shorthand for "we are sampling all the tokens in parallel". Technically it is not a single probability distribution but for each i_th token, you have a p_ith distribution over the vocabulary conditioned on all the past tokens. Also the conditioning y_1:m is shifted right. As explained at the end of Appendix A, it is pseudocode and some details are omitted.
If you look at the implementation, basically it consists of sampling from the model https://github.com/teelinsan/parallel-decoding/blob/19769bf5aa2a41f02d6b0344aa1ee88e3d59cfa4/src/ipi/decoders/gs_jacobi.py#L68-L75 and then doing argmax with dimension -1 https://github.com/teelinsan/parallel-decoding/blob/19769bf5aa2a41f02d6b0344aa1ee88e3d59cfa4/src/ipi/decoders/gs_jacobi.py#L82
Hope this help.
Andrea
In algorithm 1,you state that I mean is supposed to be exactly 1,Am I missing or mistaking someting here?