Two problems with your code

The timing_signal is a type of positional embedding used with the image transformer, using the default in the original implementation, see https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/image_transformer.py#L206, https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py#L504, https://github.com/tensorflow/tensor2tensor/blob/5623deb79cfcd28f8f8c5463b58b5bd76a81fd0d/tensor2tensor/layers/common_attention.py#L408

For the masking in the attention, the masking occurs here: https://github.com/sahajgarg/image_transformer/blob/d33b8d007299b434c62e068e1dad35b8a2688212/image_transformer.py#L303 This generates an upper triangular mask on the logits of the attention, preventing any information from future pixels from reaching the current pixel. However, the training code can evaluate the conditional probability of each pixel given all the previous pixels simultaneously, so long as this masking does occur.

sahajgarg / image_transformer

Two problems with your code #4