open-mmlab / mmengine

OpenMMLab Foundational Library for Training Deep Learning Models
https://mmengine.readthedocs.io/
Apache License 2.0
1.16k stars 346 forks source link

[Bug] Once using FSDP, bug happens when building optimizer #1032

Open jsrdcht opened 1 year ago

jsrdcht commented 1 year ago

Prerequisite

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1', 'NVIDIA GeForce RTX 2080 Ti'), ('CUDA_HOME', None), ('GCC', 'gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0'), ('PyTorch', '1.13.0+cu117'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201402\n - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.14.0+cu117'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.0')])

Reproduces the problem - code sample

_base_ = [
    '../_base_/datasets/iBioHash.py',
    '../_base_/schedules/imagenet_bs2048_AdamW.py',
    '../_base_/default_runtime.py'
]

model = dict(
    type='ImageClassifier',
    backbone=dict(
        type='VisionTransformer',
        arch='b',
        img_size=224,
        patch_size=16,
        drop_rate=0.1,
        init_cfg=dict(
            type='Pretrained',
            checkpoint="https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-p16_3rdparty_pt-64xb64_in1k-224_20210928-02284250.pth",
            prefix='backbone')
        ),
    neck=None,
    head=dict(
        type='GreedyHashHead',
        bit=48,
        num_classes=1000,
        alpha=0.01,
        in_channels=768,
        loss=dict(type='CrossEntropyLoss', loss_weight=1.0)
    ))

** # add this line will cause error
model_wrapper_cfg=dict(type='MMFullyShardedDataParallel', cpu_offload=False) **

#dataset settings
train_loader = dict(
    batch_size=32,
    num_workers=4,
)

# schedule settings
optim_wrapper = dict(
    optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.3),
    clip_grad=dict(max_norm=1.0),
    # specific to vit pretrain
    paramwise_cfg=dict(custom_keys={
        '.cls_token': dict(decay_mult=0.0),
        '.pos_embed': dict(decay_mult=0.0)
    }),
)

# learning policy
warmup_epochs = 15  # about 10000 iterations for ImageNet-1k
param_scheduler = [
    # warm up learning rate scheduler
    dict(
        type='LinearLR',
        start_factor=1e-3,
        by_epoch=True,
        end=warmup_epochs,
        # update by iter
        convert_to_iter_based=True),
    # main learning rate scheduler
    dict(
        type='CosineAnnealingLR',
        eta_min=1e-5,
        by_epoch=True,
        begin=warmup_epochs)
]

# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=200)
val_cfg = None
test_cfg = None

# NOTE: `auto_scale_lr` is for automatically scaling LR,
# based on the actual training batch size.
auto_scale_lr = dict(base_batch_size=64)

Reproduces the problem - command or script

bash ./tools/dist_train.sh /home/ct/code/fgvc/mmclassification/configs/fgvc/greedyhash_vit-base-p16_pt-64xb64_iBioHash1k-224.py 2 \
    --work-dir './results/baseline' --no-validate --amp --cfg-options seed=42+deterministic

Reproduces the problem - error message

Traceback (most recent call last):
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/optim/adamw.py", line 92, in __init__
    super(AdamW, self).__init__(params, defaults)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/optim/optimizer.py", line 61, in __init__
    raise ValueError("optimizer got an empty parameter list")
ValueError: optimizer got an empty parameter list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ct/code/fgvc/mmclassification/./tools/train.py", line 162, in <module>
    main()
  File "/home/ct/code/fgvc/mmclassification/./tools/train.py", line 158, in main
    runner.train()
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1672, in train
    self.optim_wrapper = self.build_optim_wrapper(self.optim_wrapper)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1085, in build_optim_wrapper
    return build_optim_wrapper(self.model, optim_wrapper)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/optim/optimizer/builder.py", line 113, in build_optim_wrapper
    optim_wrapper = optim_wrapper_constructor(model)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/optim/optimizer/default_constructor.py", line 305, in __call__
    optimizer = OPTIMIZERS.build(optimizer_cfg)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/registry/registry.py", line 548, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/registry/build_functions.py", line 135, in build_from_cfg
    raise type(e)(
ValueError: class `AdamW` in torch/optim/adamw.py: optimizer got an empty parameter list
03/30 16:47:44 - mmengine - WARNING - The "data sampler" registry in mmcls did not set import location. Fallback to call `mmcls.utils.register_all_modules` instead.
03/30 16:47:44 - mmengine - WARNING - The "optimizer wrapper constructor" registry in mmcls did not set import location. Fallback to call `mmcls.utils.register_all_modules` instead.
03/30 16:47:44 - mmengine - WARNING - The "optimizer" registry in mmcls did not set import location. Fallback to call `mmcls.utils.register_all_modules` instead.
Traceback (most recent call last):
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/optim/adamw.py", line 92, in __init__
    super(AdamW, self).__init__(params, defaults)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/optim/optimizer.py", line 61, in __init__
    raise ValueError("optimizer got an empty parameter list")
ValueError: optimizer got an empty parameter list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ct/code/fgvc/mmclassification/./tools/train.py", line 162, in <module>
    main()
  File "/home/ct/code/fgvc/mmclassification/./tools/train.py", line 158, in main
    runner.train()
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1672, in train
    self.optim_wrapper = self.build_optim_wrapper(self.optim_wrapper)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1085, in build_optim_wrapper
    return build_optim_wrapper(self.model, optim_wrapper)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/optim/optimizer/builder.py", line 113, in build_optim_wrapper
    optim_wrapper = optim_wrapper_constructor(model)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/optim/optimizer/default_constructor.py", line 305, in __call__
    optimizer = OPTIMIZERS.build(optimizer_cfg)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/registry/registry.py", line 548, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/mmengine/registry/build_functions.py", line 135, in build_from_cfg
    raise type(e)(
ValueError: class `AdamW` in torch/optim/adamw.py: optimizer got an empty parameter list
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 45740) of binary: /home/ct/anaconda3/envs/pytorch_mm2/bin/python
Traceback (most recent call last):
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ct/anaconda3/envs/pytorch_mm2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-30_16:47:47
  host      : ct-ESC4000A-E10
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 45741)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_16:47:47
  host      : ct-ESC4000A-E10
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 45740)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Additional information

The user is expressing their desire to be able to start running their code in the same way after adding the FSDP configuration line to the file. They also want the training process to continue in FSDP mode.

HAOCHENYE commented 1 year ago

Sorry for the poor experience with FSDP. Now we're doing some refactoring to have better support for large scale training strategies, such as FSDP, Deepspeeed .etc. We will create a development branch this month to experimentally support different types of large-scale training strategies, and we believe you will be able to experience them soon :laughing: .

flstahl commented 1 year ago

Sorry for the poor experience with FSDP. Now we're doing some refactoring to have better support for large scale training strategies, such as FSDP, Deepspeeed .etc. We will create a development branch this month to experimentally support different types of large-scale training strategies, and we believe you will be able to experience them soon đŸ˜† .

Hi and thanks for your support on these bugs! Following-up on this. I am experiencing the same issue when trying to wrap my BEVFusion model with FSDP. Is there any news on this?