Open tGhattas opened 4 days ago
Yes, the batch size we mention is the total number of tokens see per optimizer update. For our experiments, we usually had a couple thousand optimizer steps per stage as bs and tokens processed varies from stage to stage.
Hey! I'm trying to reproduce your experiments, and in the paper its said that the batch size is 2^{15, 16,..} a safe assumption to make its not the actual batch size and its couples with gradient accumulation, so something like 16 (BS) * (GradiAccum) 2048 = 2^15, can you confirm? and didn't that mean very few optimizer steps?
Thanks