Currently, mixed precision training is implemented in NNTrainer, but gradient clipping considering loss scale has not been implemented yet.
In Torch's example, it is implemented as follows, and there is a need to implement this in NNTrainer too.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
with autocast(device_type='cuda', dtype=torch.float16):
output = model(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)
# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
Currently, mixed precision training is implemented in NNTrainer, but gradient clipping considering loss scale has not been implemented yet.
In Torch's example, it is implemented as follows, and there is a need to implement this in NNTrainer too.