Gradient accumulation may requires scaling before backward

mlfoundations / open_clip

An open source implementation of CLIP.

Other

9.93k stars 959 forks source link

Gradient accumulation may requires scaling before backward #761

Open yonghanyu opened 9 months ago

yonghanyu commented 9 months ago

In function train_one_epoch, in the file src/training/train.py from line 156 to 162, as shown below:

                    losses = loss(**inputs, **inputs_no_accum, output_dict=True)
                    del inputs
                    del inputs_no_accum
                    total_loss = sum(losses.values())
                    losses["loss"] = total_loss

                backward(total_loss, scaler)

Shouldn't we take the average of loss for gradient accumulation before calling backward()?

mitchellnw commented 9 months ago

Potentially, but I'm not totally sure. I think a test would be useful here, i.e., with and without the scaling compared the non-accum baseline.