Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?
Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?