Training behavior with gradient accumulation

lyndonlauder commented 1 week ago

Hi

I am training a scnet model and experiencing strange behavior when using gradient accumulation. When I train using a batch size of 6 with 5 gradient accumulation steps my model eventually stops learning but then i changed to a bigger GPU and used a batch size of 10 with 3 gradient accumulation steps with the model checkpoint and the model loss started going down again. This is strange since both times the total batch size was 30.

Does your model/training code have any kind of batch specific code/calculation? Similar to how batch normalization only works on mini batches and not accumulated batches.

Thank you for any advice

starrytong commented 1 week ago

Did you use a single GPU for training? How is the group_size set in remix?

lyndonlauder commented 1 week ago

@starrytong Yes i used a single GPU. For remix i used group_size: 6

starrytong commented 1 week ago

So the training data might be different under these two configurations, as remix will reorganize different data samples within a batch.

starrytong / SCNet

Training behavior with gradient accumulation #20