[BUG] ddp training error,Reducer buckets have been rebuilt in this iteration. loss become nan ,and exit

Describe the bug A clear and concise description of what the bug is. I 'm try to use this command to train dataset ,my command like this:


./distributed_train.sh 4 /data/test/ --cutmix 0.5 --mixup-prob 0.7
  --model coatnet_2_rw_224.sw_in12k_ft_in1k  --amp    --epochs 30 --batch-size 32 --pretrained  --sync-bn  --model-ema --model-ema-decay 0.99

and I got this error:

ote that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune th
e variable for optimal performance in your application as needed. 
*****************************************
Added key: store_based_barrier_key:1 to store for rank: 0
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 2
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Added key: store_based_barrier_key:1 to store for rank: 3
Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Training in distributed mode with multiple processes, 1 device per process.Process 2, total 4, device cuda:2.
Training in distributed mode with multiple processes, 1 device per process.Process 3, total 4, device cuda:3.
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Training in distributed mode with multiple processes, 1 device per process.Process 0, total 4, device cuda:0.
Training in distributed mode with multiple processes, 1 device per process.Process 1, total 4, device cuda:1.
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Model coatnet_2_rw_224_sw_in12k_ft_in1k created, param count:73868400
Data processing configuration for current model + dataset:
        input_size: (3, 224, 224)
        interpolation: bicubic
        mean: (0.5, 0.5, 0.5)
        std: (0.5, 0.5, 0.5)
        crop_pct: 0.95
        crop_mode: center
Converted model to use Synchronized BatchNorm. WARNING: You may have issues if using zero initialized BN layers (enabled by default for ResNets
) while sync-bn enabled.
Learning rate (0.05) calculated from base learning rate (0.1) and global batch size (128) with linear scaling.
Using native Torch AMP. Training in mixed precision.
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)

Train: 0 [   0/2816 (  0%)]  Loss: 9.223 (9.22)  Time: 8.691s,   14.73/s  (8.691s,   14.73/s)  LR: 1.000e-05  Data: 0.493 (0.493)
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Train: 0 [  50/2816 (  2%)]  Loss: 6.922 (8.07)  Time: 0.297s,  431.38/s  (0.459s,  278.67/s)  LR: 1.000e-05  Data: 0.004 (0.014)
Train: 0 [ 100/2816 (  4%)]  Loss: 6.903 (7.68)  Time: 0.278s,  460.46/s  (0.377s,  339.90/s)  LR: 1.000e-05  Data: 0.004 (0.009)
Train: 0 [ 150/2816 (  5%)]  Loss: 6.834 (7.47)  Time: 0.297s,  431.31/s  (0.347s,  369.32/s)  LR: 1.000e-05  Data: 0.004 (0.007)
Train: 0 [ 200/2816 (  7%)]  Loss: 6.836 (7.34)  Time: 0.299s,  428.48/s  (0.333s,  384.48/s)  LR: 1.000e-05  Data: 0.004 (0.006)
Train: 0 [ 250/2816 (  9%)]  Loss: 6.842 (7.26)  Time: 0.292s,  437.72/s  (0.325s,  393.64/s)  LR: 1.000e-05  Data: 0.003 (0.006)
Train: 0 [ 300/2816 ( 11%)]  Loss: 6.789 (7.19)  Time: 0.274s,  467.43/s  (0.319s,  400.71/s)  LR: 1.000e-05  Data: 0.004 (0.005)

To Reproduce Steps to reproduce the behavior:

run command

Expected behavior A clear and concise description of what you expected to happen.

I want to continue Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. Windows 10, Ubuntu 18.04] ubuntu 20.04
This repository version [e.g. pip 0.3.1 or commit ref] current branch ,03/15/23
PyTorch version w/ CUDA/cuDNN [e.g. from conda list, 1.7.0 py3.8_cuda11.0.221_cudnn8.0.3_0]

torch.version '1.12.1' Additional context Add any other context about the problem here.

huggingface / pytorch-image-models

[BUG] ddp training error,Reducer buckets have been rebuilt in this iteration. loss become nan ,and exit #1722