k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
928 stars 295 forks source link

Zipformer does not converge when trained on multi hosts #1041

Open zhangyike opened 1 year ago

zhangyike commented 1 year ago

Hi, I integrate the codes of Zipformer as well as ScaleAdam/Eden Optimizer to my projects and reach satisfactory results when I train the model using 8 GPUs on a single host. However, the Zipformer model converges worse when I train the model using 16 GPUs on two hosts. I have adjusted different hyper-parameters such as learning rate, lr_batches, lr_epochs, and etc., but the losses on CV sets no longer decrease after epoch 1. Can anyone give some advices? Thanks.

zhangyike commented 1 year ago

The codes and training data are same for both singe host training and multi hosts training setups.

danpovey commented 1 year ago

Perhaps you could share some plots? I want to see whether it just fails to find the alignment completely (e.g. loss > 0.3 or so). How much data do you have? BTW, something I would be concerned about in this scenario is whether the model parameters start off identical, and remain identical, over the 2 machines. DDP, I believe, never syncs model parameters; it relies on the fact that the gradients are synced across workers and the initial parameters and parameter updates are identical. This relies on the fact that the code path is identical in the update code, and that the kernels invoked in the parameter update are all deterministic. This probably requires to have identical library versions across the machines involved. If the parameters become different on different hosts this is not good. To test this you could do something like: occasionally compute the sum of all the parameters in the model and print it out, and check that the printed-out value is identical across all the workers.

zhangyike commented 1 year ago

Here are some details: Training data is about 2000 hours. hyper-parameters are: lr: 0.01 lr_batches: 5000 lr_epochs: 3.5 1) on a singe host with 8GPUs, losses are: epoch 1 cv_loss: 8.771843910217285 epoch 2 cv_loss: 6.188486576080322 epoch 3 cv_loss: 5.273587703704834 epoch 4 cv_loss: 4.710790634155273 epoch 5 cv_loss: 4.387606620788574 epoch 6 cv_loss: 4.074428558349609 epoch 7 cv_loss: 3.8851776123046875 epoch 8 cv_loss: 3.7500312328338623 epoch 9 cv_loss: 3.635695457458496 epoch 10 cv_loss: 3.585954189300537

2) on two hosts, each has 8GPUs, losses are: epoch 1 cv_loss: 13.900306701660156 epoch 2 cv_loss: 8.846643447875977 epoch 3 cv_loss: 7.499380588531494 epoch 4 cv_loss: 6.767868995666504 epoch 5 cv_loss: 6.994665145874023

I have tried different hyper-parameters setups on two hosts scene, but it only affect the cv loss of first two or three epochs.

I have also tried Zipformer on a large dataset about 40,000 hours audios. hyper-parameters are : lr: 0.01 lr_batches: 5000 lr_epochs: 3.5 The cv losses on two hosts are: epoch 1 cv_loss: 5.434179306030273 epoch 2 cv_loss: 5.555027484893799 epoch 3 cv_loss: 5.392739772796631

The cv losses on a single hosts are: epoch 1 cv_loss: 3.1601388454437256 epoch 2 cv_loss: 2.655914068222046 epoch 3 cv_loss: 2.4898722171783447 epoch 4 cv_loss: 2.388228178024292

zhangyike commented 1 year ago

losses on two hosts are:

I use the architecture of hybrid CTC and AED, the CTC weight is 0.2

danpovey commented 1 year ago

You must be normalizing the loss differently from us? Because normally our losses are around 0.1 or less. If it is failing to discover the alignment, it could be that it's because the large number of workers is making it 'warm up' too fast. In get_adjusted_batch_count(), it multiplies by world_size. You could try adjusting that formula e.g. removing the world_size factor or making that factor max out at 4 or so. (also: the more logging output you show, the more I'd be able to say.)

Also, the base_lr of 0.01 is very low. That wouldn't cause this kind of problem, but I think a higher value would give better results in later epochs. Even small changes in lr can make a big difference in this setup.

danpovey commented 1 year ago

We normally print out lots of things, this info would tell me a lot. E.g. you mention it has attention encoder-decoder. If you are using a conventional Transformer, this can interact badly with fp16 training, if you are using that it would be interesting to know the grad_scale values.

zhangyike commented 1 year ago

Thanks for your advices. I will checkout whether parameters on different hosts are same first as well as adjust the warn up settings. In addition, I do not use fp16 training.

zhangyike commented 1 year ago

You must be normalizing the loss differently from us? Because normally our losses are around 0.1 or less. If it is failing to discover the alignment, it could be that it's because the large number of workers is making it 'warm up' too fast. In get_adjusted_batch_count(), it multiplies by world_size. You could try adjusting that formula e.g. removing the world_size factor or making that factor max out at 4 or so. (also: the more logging output you show, the more I'd be able to say.)

Also, the base_lr of 0.01 is very low. That wouldn't cause this kind of problem, but I think a higher value would give better results in later epochs. Even small changes in lr can make a big difference in this setup.

I trained a model used the same setups as mentioned above using 2 hosts, each host has 4 GPUs. The cv losses are similar to the model trained on a single host with 8 GPUs. Also, I print the model parameters on each GPU, these are exactly the same. So I guess some hyper-parameters should be adjusted when trained on more GPUs. I do not find the function 'get_adjusted_batch_count()'. But I guess what you mean is the factor in the following codes.

class Eden(LRScheduler):
    """
    Eden scheduler.
    The basic formula (before warmup) is:
      lr = base_lr * (((batch**2 + lr_batches**2) / lr_batches**2) ** -0.25 *
                     (((epoch**2 + lr_epochs**2) / lr_epochs**2) ** -0.25)) * warmup
    where `warmup` increases from linearly 0.5 to 1 over `warmup_batches` batches
    and then stays constant at 1.

     E.g. suggest base_lr = 0.04 (passed to optimizer) if used with ScaledAdam

    Args:
        optimizer: the optimizer to change the learning rates on
        lr_batches: the number of batches after which we start significantly
              decreasing the learning rate, suggest 5000.
        lr_epochs: the number of epochs after which we start significantly
              decreasing the learning rate, suggest 6 if you plan to do e.g.
              20 to 40 epochs, but may need smaller number if dataset is huge
              and you will do few epochs.
    """

    def __init__(
        self,
        optimizer: Optimizer,
        lr_batches: Union[int, float],
        lr_epochs: Union[int, float],
        warmup_batches: Union[int, float] = 500.0,
        verbose: bool = False,
    ):
        super(Eden, self).__init__(optimizer, verbose)
        self.lr_batches = lr_batches
        self.lr_epochs = lr_epochs
        self.warmup_batches = warmup_batches

    def get_lr(self):
        factor = (
            (self.batch**2 + self.lr_batches**2) / self.lr_batches**2
        ) ** -0.25 * (
            ((self.epoch**2 + self.lr_epochs**2) / self.lr_epochs**2) ** -0.25
        )
        warmup_factor = (
            1.0
            if self.batch >= self.warmup_batches
            else 0.5 + 0.5 * (self.batch / self.warmup_batches)
        )

        return [x * factor * warmup_factor for x in self.base_lrs]

I do not think the model warms up to fast since 'self.batch' is accumulated on each GPU respectively when I launch the training task in the following way:

        for ((i = 0; i < $num_gpus; ++i)); do
        {
            rank=`expr $node_rank \* $num_gpus + $i`
            python wenet/bin/train_deprecated_debug.py --gpu $i \
                --config $train_config \
                --train_data $feat_dir/${train_set}/${train_data} \
                --cv_data $feat_dir/${dev_set}/format.data.gz \
                ${checkpoint:+--checkpoint $checkpoint} \
                --model_dir $dir \
                --ddp.init_method $init_method \
                --ddp.world_size $world_size \
                --ddp.rank $rank \
                --ddp.dist_backend $dist_backend \
                --num_workers 3 \
                $cmvn_opts >> $dir/log_${rank} 2>&1
        } &
        done
danpovey commented 1 year ago

ah, get_adjusted_batch_count() is in a version that we haven't committed yet. The absolute value of your loss is quite large but perhaps you are normalizing per sequence instead of per symbol. You could perhaps try increasing warmup_batches from 500 to 1000. In general having more GPUs should make convergence better, not worse, since you are averaging the update over a larger number of workers, so the noise aspect of the update gets reduced.

You may need to make sure that in the data-loader, it actually loads different data across the different workers. But I doubt that will be a problem as it just depends on rank and world-size. RE "I print the model parameters on each GPU, these are exactly the same."- make sure this is also the case after training the model for a bit.

Transformer-type model convergence tends to be tricky. Our more recent version of the zipformer that we are just about to commit (really soon, this time, maybe a week) is maybe slightly better convergence-wise. I am surprised it is not converging well with the low learning rate of 0.01. With such a low learning rate it may possibly be necessary to increase the warmup periods a bit (there may be 2 warmup periods specified, for the loss function and the model).

zhangyike commented 1 year ago

Thank you for your suggestions. I found the reason why Zipformer cannot converge trained with 16 GPUs on my dataset. It is due to the model warmup setting.


def get_layers_to_drop(self, rnd_seed: int):
ans = set()
if not self.training:
return ans
    batch_count = self.batch_count
    num_layers = len(self.layers)

    def get_layerdrop_prob(layer: int) -> float:
        layer_warmup_begin = self.layers[layer].warmup_begin
        layer_warmup_end = self.layers[layer].warmup_end

        initial_layerdrop_prob = 0.5
        final_layerdrop_prob = 0.05

        if batch_count == 0:
            # As a special case, if batch_count == 0, return 0 (drop no
            # layers).  This is rather ugly, I'm afraid; it is intended to
            # enable our scan_pessimistic_batches_for_oom() code to work correctly
            # so if we are going to get OOM it will happen early.
            # also search for 'batch_count' with quotes in this file to see
            # how we initialize the warmup count to a random number between
            # 0 and 10.
            return 0.0
        elif batch_count < layer_warmup_begin:
            return initial_layerdrop_prob
        elif batch_count > layer_warmup_end:
            return final_layerdrop_prob
        else:
            # linearly interpolate
            t = (batch_count - layer_warmup_begin) / layer_warmup_end
            assert 0.0 <= t < 1.001, t
            return initial_layerdrop_prob + t * (
                final_layerdrop_prob - initial_layerdrop_prob
            )

The drop rate is related  to the batch count, and it decreases from 0.5 to 0.05 as the batch count increases from 0 to layer_warmup_end. When I use 16 GPUs, theres are about 2000 batches in one epoch but the default value for layer_warmup_end is 4000 in k2 scripts. Hence, the drop rate is rather large during the whole training process. After I set warmup_batches in model warmup to 1600, the Zipformer converged well on 16 GPUs. Although the cv losses is slightly worse than those of the model trained on 8 GPUs, I think the gap would be disappeared if I adjust some others hyper-parameters.

When trained on a large scale dataset, the default model warmup setting would not make a different. I have mentioned that Zipformer cannot converge on a large scale dataset. This may be because I mistakenly set base_lr to 0.05 rather than 0.01. I will verified it later.
danpovey commented 1 year ago

You should be able to get it to converge for a base_lr of 0.04 or 0.035. Small changes in base-LR will make a big difference-- much more than if you were using Adam and a more typical network with layer-norms.