huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
31.95k stars 4.73k forks source link

[BUG] ddp training error,Reducer buckets have been rebuilt in this iteration. loss become nan ,and exit #1722

Closed tms2003 closed 1 year ago

tms2003 commented 1 year ago

Describe the bug A clear and concise description of what the bug is. I 'm try to use this command to train dataset ,my command like this:


./distributed_train.sh 4 /data/test/ --cutmix 0.5 --mixup-prob 0.7
  --model coatnet_2_rw_224.sw_in12k_ft_in1k  --amp    --epochs 30 --batch-size 32 --pretrained  --sync-bn  --model-ema --model-ema-decay 0.99

and I got this error:

ote that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune th
e variable for optimal performance in your application as needed. 
*****************************************
Added key: store_based_barrier_key:1 to store for rank: 0
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 2
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Added key: store_based_barrier_key:1 to store for rank: 3
Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Training in distributed mode with multiple processes, 1 device per process.Process 2, total 4, device cuda:2.
Training in distributed mode with multiple processes, 1 device per process.Process 3, total 4, device cuda:3.
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Training in distributed mode with multiple processes, 1 device per process.Process 0, total 4, device cuda:0.
Training in distributed mode with multiple processes, 1 device per process.Process 1, total 4, device cuda:1.
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it 
will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/
TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Loading pretrained weights from Hugging Face hub (timm/coatnet_2_rw_224.sw_in12k_ft_in1k)
filename:pytorch_model.bin
Model coatnet_2_rw_224_sw_in12k_ft_in1k created, param count:73868400
Data processing configuration for current model + dataset:
        input_size: (3, 224, 224)
        interpolation: bicubic
        mean: (0.5, 0.5, 0.5)
        std: (0.5, 0.5, 0.5)
        crop_pct: 0.95
        crop_mode: center
Converted model to use Synchronized BatchNorm. WARNING: You may have issues if using zero initialized BN layers (enabled by default for ResNets
) while sync-bn enabled.
Learning rate (0.05) calculated from base learning rate (0.1) and global batch size (128) with linear scaling.
Using native Torch AMP. Training in mixed precision.
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/incar/miniconda3/envs/py3.8/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Grad strides do not match bucket vie
w strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was 
constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 4096, 4096]
bucket_view.sizes() = [1024, 4096, 1, 1], strides() = [4096, 1, 1, 1] (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work
/torch/csrc/distributed/c10d/reducer.cpp:312.)

Train: 0 [   0/2816 (  0%)]  Loss: 9.223 (9.22)  Time: 8.691s,   14.73/s  (8.691s,   14.73/s)  LR: 1.000e-05  Data: 0.493 (0.493)
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
Train: 0 [  50/2816 (  2%)]  Loss: 6.922 (8.07)  Time: 0.297s,  431.38/s  (0.459s,  278.67/s)  LR: 1.000e-05  Data: 0.004 (0.014)
Train: 0 [ 100/2816 (  4%)]  Loss: 6.903 (7.68)  Time: 0.278s,  460.46/s  (0.377s,  339.90/s)  LR: 1.000e-05  Data: 0.004 (0.009)
Train: 0 [ 150/2816 (  5%)]  Loss: 6.834 (7.47)  Time: 0.297s,  431.31/s  (0.347s,  369.32/s)  LR: 1.000e-05  Data: 0.004 (0.007)
Train: 0 [ 200/2816 (  7%)]  Loss: 6.836 (7.34)  Time: 0.299s,  428.48/s  (0.333s,  384.48/s)  LR: 1.000e-05  Data: 0.004 (0.006)
Train: 0 [ 250/2816 (  9%)]  Loss: 6.842 (7.26)  Time: 0.292s,  437.72/s  (0.325s,  393.64/s)  LR: 1.000e-05  Data: 0.003 (0.006)
Train: 0 [ 300/2816 ( 11%)]  Loss: 6.789 (7.19)  Time: 0.274s,  467.43/s  (0.319s,  400.71/s)  LR: 1.000e-05  Data: 0.004 (0.005)

To Reproduce Steps to reproduce the behavior:

  1. run command

Expected behavior A clear and concise description of what you expected to happen.

I want to continue Screenshots If applicable, add screenshots to help explain your problem.

image

Desktop (please complete the following information):

rwightman commented 1 year ago

@tms2003 as it says, those reducer / shape warnings are not errors, a bit annoying that they haven't removed them... happens in a LOT of workloads, ignore

The NaN is expected if that's your command line, you should use grad clipping, adamw, and a much lower learning rate than the defaults.