Closed radiradev closed 2 years ago
Or are we just predicting the last z_index given the features and the previous z_indices?
Yes, we are. This is how the transformer is trained or any other network (RNN) that tries to autoregressively generate a sequence of tokens.
During training (and inference), the model predicts the next token given the current one + previous ones.
Schematically:
c - condition token
z - data token
cond_size = 4
input: cccczzzz
input[:-1]: cccczzz
Transformer: |||||||
logits: ccczzzz
logits[cond_size-1:]: zzzz
loss: (compare)
targets: zzzz
Hello,
I am trying to understand these lines could you further elaborate what is the procedure of training the transformer here?
`# target includes all sequence elements (no need to handle first one
differently because we are conditioning)
Using the features and all of the indices what exactly are we trying to predict? Isn't the target all the z_indices that we are already giving to the transformer? Or are we just predicting the last z_index given the features and the previous z_indices?