Closed ggluo closed 4 years ago
I have the same question, I'm unclear where masking is implemented.
For the masking in the attention, the masking occurs here: https://github.com/sahajgarg/image_transformer/blob/d33b8d007299b434c62e068e1dad35b8a2688212/image_transformer.py#L303 This generates an upper triangular mask on the logits of the attention, preventing any information from future pixels from reaching the current pixel. The training code can evaluate the conditional probability of each pixel given all the previous pixels simultaneously, so long as this masking does occur.
Hi, Sahaj. Maybe a dumb question. Can I know how the masked attention weigtht is implemented in your script?