k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
792 stars 267 forks source link

Prevent large values in conv module in wav2vec2_module.py in SSL recipe #1593

Open danpovey opened 1 month ago

teowenshen commented 3 weeks ago

I'm moving the conversation here since the previous PR was closed.

I ran @yfyeung 's training command using the merged k2ssl codes, with the batch-size and world-size adjusted to fit my environment. However, training crashed at epoch 31.

I then implemented @danpovey 's recommended changes, and reran training from the checkpoint at epoch 26. This time, training crashed at epoch 30.

The inf_check for this epoch 30 are as below. Infinity occurred at the forward pass of feature_extractor. inf_check_ep30.txt

I have the diagnostics too for the start of epoch 30, but the file is too big to attach on Github. I can send it through email too if you'd like to take a look at it. From what I can see, the Convolutional modules in the feature_extractor indeed have larger outputs.

Also, I'm wondering if I should rerun training from scratch using the recommended changes of penalizing large abs_value, or change the feature_extractor to Fbank-based Conv2dSubsampling. This current implementation consumes wav directly, matching original HuBERT. What do you think?

yfyeung commented 3 weeks ago

I'm moving the conversation here since the previous PR was closed.

I ran @yfyeung 's training command using the merged k2ssl codes, with the batch-size and world-size adjusted to fit my environment. However, training crashed at epoch 31.

I then implemented @danpovey 's recommended changes, and reran training from the checkpoint at epoch 26. This time, training crashed at epoch 30.

The inf_check for this epoch 30 are as below. Infinity occurred at the forward pass of feature_extractor. inf_check_ep30.txt

I have the diagnostics too for the start of epoch 30, but the file is too big to attach on Github. I can send it through email too if you'd like to take a look at it. From what I can see, the Convolutional modules in the feature_extractor indeed have larger outputs.

Also, I'm wondering if I should rerun training from scratch using the recommended changes of penalizing large abs_value, or change the feature_extractor to Fbank-based Conv2dSubsampling. This current implementation consumes wav directly, matching original HuBERT. What do you think?

Hi, batch size is crucial for SSL. When batch size decreases, gradient noise becomes very large, which has a bad impact on half-precision and convergence. Here is the tensorboard of pre-training. image

teowenshen commented 3 weeks ago

When batch size decreases, gradient noise becomes very large, which has a bad impact on half-precision and convergence.

I see. This is my training command:

python zipformer/pretrain.py \
  --world-size 4 \
  --num-epochs 100 \
  --start-epoch 30 \
  --use-fp16 1 \
  --exp-dir zipformer/exp3/pretrain \
  --manifest-dir data/raw \
  --full-libri 1 \
  --max-duration 300 \
  --accum-grad 4 \
  --do-normalize 0 \
  --mask-prob 0.8 \
  --dropout-input 0.1 \
  --dropout-features 0.1 \
  --feature-grad-mult 0.1 \
  --untie-final-proj 1 \
  --num-encoder-layers "2,2,3,4,3,2" \
  --feedforward-dim "512,768,1024,1536,1024,768" \
  --encoder-dim "192,256,448,768,448,192" \
  --encoder-unmasked-dim "192,192,256,256,256,192" \
  --base-lr 0.045

A little explanation about how I decided my batch size: From your command, I think you have a total of 4800s per batch. Based on my available GPU world-size=4, I calculated accum-grad=4 and max-duration=300 so that the same 4800s of data is collected before the model is updated. However, do you think my approximation to your setup could be wrong?

Here is the tensorboard of my pre-training experiments.

Thank you so much! I will have a look!

yfyeung commented 3 weeks ago
  --max-duration 300 \
  --accum-grad 4 \

The current gradient accumulation mechanism simulates multi-GPU setup. You can simulate my setup using 4 GPUs with acc_grad set to 2 and max-duration set to 600, but it cannot simultaneously simulate multi-GPU and large batch size.

IMO, keeping 8 GPUs and the same max-duration can easily reproduce our experimental results.

danpovey commented 3 weeks ago

OK, this error is different from the error you got before. It's the grads that are infinite, not the activations:

2024-04-22 11:07:35,431 WARNING [hooks.py:69] (2/4) The sum of module.feature_extractor.conv_layers.0.0.grad[0] is not finite
2024-04-22 11:07:35,438 WARNING [hooks.py:69] (2/4) The sum of module.feature_extractor.conv_layers.0.1.grad[0] is not finite
2024-04-22 11:07:35,438 WARNING [hooks.py:78] (2/4) The sum of module.feature_extractor.conv_layers.0.0.weight.param_grad is not finite

Note that these are the grads after aggregating over the batch. Too-large grads on the first input convolution layer when training with fp16 are a problem I have noticed before. Probably the most certain fix would be to insert a ScaleGrad module to scale the grad down at that point during the backprop. E.g. insert it after conv_layers.0, into the nn.Sequential's list or something. See how I have used that in the zipformer recipe, to solve a similar issue. This would change the numbering though, so would affect loading parameters. Alternatively you might be able to add a ScaleGrad module at the end of the layer type you use in conv_layers, whatever it is. Use a moderate scale, like, 0.5, if it will appear multiple times. And it must never be inserted in a "side branch" of the computation graph that will re-join, or it will cause wrong update directions. (Just don't add it anywhere except where computation is purely sequential.) Caution, a too-small scale will lead to too much roundoff before aggregation of gradients. You can make a pull request on top of my pull request. I believe this will make the training recipe more robust; it's good that you found this issue, we should make sure to commit a fix to icefall.

BTW, my approach with the zipformer recipe has been to fix instability or crashes one by one like this, as they appear, in the hope that after we address all the failure modes the recipe should be quite robust.

teowenshen commented 3 weeks ago

Thank you for your guidance! By comparing subsampling.py with wav2vec2_module.py, I think I get what you mean. This is what I am going to test once I have the GPUs again:

  if is_layer_norm:
      return nn.Sequential(
          make_conv(),
          ScaleGrad(0.5),
          nn.Dropout(p=dropout),
          nn.Sequential(
              TransposeLast(),
              Fp32LayerNorm(dim, elementwise_affine=True),
              TransposeLast(),
          ),
          nn.GELU(),
      )
  elif is_group_norm:
      return nn.Sequential(
          make_conv(),
          ScaleGrad(0.5),
          nn.Dropout(p=dropout),
          Fp32GroupNorm(dim, dim, affine=True),
          nn.GELU(),
      )
  else:
      return nn.Sequential(
          make_conv(),
          ScaleGrad(0.5),
          nn.Dropout(p=dropout),
          nn.GELU(),
      )

Hopefully the anonymity period ends soon. Meanwhile, I will continue the debugging on my end. I will report again if I find something that works for my setup.