fix the double-jump for out-of-bound check in DataLoaderLite

karpathy / build-nanogpt

Video+code lecture on building nanoGPT from scratch

3.44k stars 473 forks source link

fix the double-jump for out-of-bound check in DataLoaderLite #23

Closed fatemi closed 2 months ago

fatemi commented 3 months ago

self.current_position has already been moved at L246 and is ready for the next batch. The condition at L248 looks at yet another move, which is not correct [this way, it basically dismisses an additional block_size at the end of file]. It should simply look at currect_position + 1 to account only for the additional target token.

lukasugar commented 3 months ago

I think we should check not for +1, but for B * T +1. Because we want to make sure that the next batch can be fetched. We moved the self.current_position to the starting position of the new batch, so we need to check if that new batch can be fetched.

I've made a PR for that change: https://github.com/karpathy/build-nanogpt/pull/42

fatemi commented 3 months ago

Due to line 246, all the processes can fetch their next batch, no need for another B * T.

lukasugar commented 3 months ago

I think that's not the case. Imagine this scenario:

there's 94 tokens in the shard, so len(self.tokens) equals 94
num_processes = 2
B=1
T=5

The first process will read tokens 0:6, 10:16... After 8th step, line 246 will be self.current_position += B * T * self.num_processes which is 80 + 1*5*2 = 90. In line 248, you'd check that 90 + 1 > 94 => False, so you'd try to fetch tokens[90:96] which is out of bounds.

Do you agree?