Closed lyndonlauder closed 1 week ago
Did you use a single GPU for training? How is the group_size set in remix?
@starrytong Yes i used a single GPU. For remix i used group_size: 6
So the training data might be different under these two configurations, as remix will reorganize different data samples within a batch.
Hi
I am training a scnet model and experiencing strange behavior when using gradient accumulation. When I train using a batch size of 6 with 5 gradient accumulation steps my model eventually stops learning but then i changed to a bigger GPU and used a batch size of 10 with 3 gradient accumulation steps with the model checkpoint and the model loss started going down again. This is strange since both times the total batch size was 30.
Does your model/training code have any kind of batch specific code/calculation? Similar to how batch normalization only works on mini batches and not accumulated batches.
Thank you for any advice