[Mixed Precision] Apply Gradient Clipping on Mixed Precision

DonghakPark commented 1 month ago

Currently, mixed precision training is implemented in NNTrainer, but gradient clipping considering loss scale has not been implemented yet.

In Torch's example, it is implemented as follows, and there is a need to implement this in NNTrainer too.

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

taos-ci commented 1 month ago

:octocat: cibot: Thank you for posting issue #2746. The person in charge will reply soon.

DonghakPark commented 1 month ago

Training Sequence

Make an FP16 copy of the weights :
Forward propagate using FP16 weights and activations
Multiply the Resulting loss by the scale factor
Backward propagate using FP16 weights, activations, and gradients
Multiply the weight gradients by 1/sacle_factor
Option process (gradient clipping, weight decay)
Update the master copy of weights in FP32

nnstreamer / nntrainer

[Mixed Precision] Apply Gradient Clipping on Mixed Precision #2746