goombalab / phi-mamba

Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models)
https://arxiv.org/abs/2408.10189
68 stars 3 forks source link

Batch size #2

Open tGhattas opened 4 days ago

tGhattas commented 4 days ago

Hey! I'm trying to reproduce your experiments, and in the paper its said that the batch size is 2^{15, 16,..} a safe assumption to make its not the actual batch size and its couples with gradient accumulation, so something like 16 (BS) * (GradiAccum) 2048 = 2^15, can you confirm? and didn't that mean very few optimizer steps?

Thanks

kevinli573 commented 4 days ago

Yes, the batch size we mention is the total number of tokens see per optimizer update. For our experiments, we usually had a couple thousand optimizer steps per stage as bs and tokens processed varies from stage to stage.