karpathy / build-nanogpt

Video+code lecture on building nanoGPT from scratch
3.61k stars 502 forks source link

Is dataloader making optimal batches? #31

Closed paraschopra closed 5 months ago

paraschopra commented 5 months ago

Have a question - in the current form of making batches, aren't we throwing away information? e.g. we take a an input and transform it into B * T matrix. Now for each row, the first token is blind to previous tokens as we never put that sequence into the training loop. Wouldn't better way to make dataloader would be something like a moving window?

paraschopra commented 5 months ago

I got the answer here: https://www.youtube.com/watch?v=l8pRSuU81PU&lc=UgxBEJSMh2LngUmeJiR4AaABAg