Open yonghanyu opened 9 months ago
In function train_one_epoch, in the file src/training/train.py from line 156 to 162, as shown below:
losses = loss(**inputs, **inputs_no_accum, output_dict=True) del inputs del inputs_no_accum total_loss = sum(losses.values()) losses["loss"] = total_loss backward(total_loss, scaler)
Shouldn't we take the average of loss for gradient accumulation before calling backward()?
Potentially, but I'm not totally sure. I think a test would be useful here, i.e., with and without the scaling compared the non-accum baseline.
In function train_one_epoch, in the file src/training/train.py from line 156 to 162, as shown below:
Shouldn't we take the average of loss for gradient accumulation before calling backward()?