current position should be 0 at the start a shard

eliebak commented 3 months ago

self.current_position should be 0 at the start and not self.current_position = self.B * self.T * self.process_rank.

AlieNiT commented 3 months ago

I guess this line is supposed to make a different offset for each of the processes running in parralel. For example if we have 3 processes and B=2 (batch size) and T=5 (context length of the model): Process 1 (with process_rank=0) iterates over the following positions: [0:10], [30:40], [60:70], ... Process 2 (with process_rank=1) iterates over the following positions: [10:20], [40:50], [70:80], ... Process 3 (with process_rank=2) iterates over the following positions: [20:30], [50:60], [80:90], ...

If we're not using distributed training, the self.process_rank will be 0 and everything works again.

karpathy commented 3 months ago

Yes this line is important to leave alone, so all processes read at different places in the input.

karpathy / build-nanogpt

current position should be 0 at the start a shard #26