microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.62k stars 220 forks source link

AttributeError: 'MMDistributedDataParallel' object has no attribute '_use_replicated_tensor_module' #179

Closed xuch98 closed 9 months ago

xuch98 commented 1 year ago

something is wrong when I execute the command below for training the model on my own dataset. bash ./dist_train.sh configs/mask_rcnn_efficientvit_m4_fpn_1x_coco.py 4 --cfg-options model.backbone.pretrained=./runs/efficientvit_m4.pth What I have done is just formatting my dataset into COCO-type and downloading the pretrained checkpoint. Here is the detailed information:

2023-06-12 20:57:08,304 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs 2023-06-12 20:57:08,304 - mmdet - INFO - Checkpoints will be saved to /home/xc/transform/Cream/EfficientViT/downstream/work_dirs/mask_rcnn_efficientvit_m4_fpn_1x_coco by HardDiskBackend. 2023-06-12 20:57:10,480 - mmdet - INFO - Saving checkpoint at 1 epochs [ ] 0/81, elapsed: 0s, ETA:Traceback (most recent call last): File "/home/xc/transform/Cream/EfficientViT/downstream/./train.py", line 245, in main() File "/home/xc/transform/Cream/EfficientViT/downstream/./train.py", line 234, in main train_detector( File "/home/xc/transform/Cream/EfficientViT/downstream/mmdet_custom/apis/train.py", line 184, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], kwargs) File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/epoch_based_runner.py", line 58, in train self.call_hook('after_train_epoch') File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook getattr(hook, fn_name)(self) File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch self._do_evaluate(runner) File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmdet/core/evaluation/eval_hooks.py", line 126, in _do_evaluate results = multi_gpu_test( File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmdet/apis/test.py", line 109, in multi_gpu_test result = model(return_loss=False, rescale=True, data) File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1535, in forward else self._run_ddp_forward(inputs, **kwargs) File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/parallel/distributed.py", line 160, in _run_ddp_forward self._use_replicated_tensor_module else self.module File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'MMDistributedDataParallel' object has no attribute '_use_replicated_tensor_module'

xinyuliu-jeffrey commented 1 year ago

Hi @xuch98 ,

This seems to be related to the mmcv package version or torch version. Could you provide your package versions in pip list? Thanks.

xuch98 commented 1 year ago

@xinyuliu-jeffrey Thank you for your reply The version of CUDA is 11.7 $ pip list Package Version Editable project location


absl-py 1.4.0 addict 2.4.0 apex 0.1 beautifulsoup4 4.12.2 bs4 0.0.1 cachetools 5.3.0 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.3 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.0.7 cycler 0.11.0 filelock 3.9.0 flatbuffers 23.3.3 fonttools 4.39.3 fsspec 2023.5.0 gitdb 4.0.10 GitPython 3.1.31 google-auth 2.17.3 google-auth-oauthlib 1.0.0 grpcio 1.53.0 huggingface-hub 0.14.1 humanfriendly 10.0 idna 3.4 imageio 2.28.1 Jinja2 3.1.2 kiwisolver 1.4.4 lazy_loader 0.2 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 mdurl 0.1.2 mmcv-full 1.7.1 mmdet 2.27.0 mmengine 0.7.3 model-index 0.1.11 mpmath 1.2.1 networkx 3.0rc1 numpy 1.24.1 oauthlib 3.2.2 onnx 1.13.1 onnxruntime 1.14.1 opencv-python 4.7.0.72 openmim 0.3.7 ordered-set 4.1.0 packaging 23.1 pandas 2.0.0 Pillow 9.3.0 pip 23.1.2 pip-search 0.0.12 protobuf 3.20.3 psutil 5.9.4 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycocotools 2.0.6 Pygments 2.15.1 pyparsing 3.0.9 python-dateutil 2.8.2 pytorch-triton 2.1.0+46672772b4 pytz 2023.3 PyWavelets 1.4.1 PyYAML 6.0 requests 2.28.1 requests-oauthlib 1.3.1 rich 13.3.5 rsa 4.9 safetensors 0.3.1 scikit-image 0.20.0 scipy 1.10.1 seaborn 0.12.2 segment-anything 1.0 /home/xc/transform/segment-anything setuptools 65.6.3 shapely 2.0.1 six 1.16.0 smmap 5.0.0 soupsieve 2.4.1 sympy 1.11.1 tabulate 0.9.0 tensorboard 2.12.2 tensorboard-data-server 0.7.0 tensorboard-plugin-wit 1.8.1 termcolor 2.3.0 terminaltables 3.1.10 thop 0.1.1.post2209072238 tifffile 2023.4.12 timm 0.9.2 tomli 2.0.1 torch 2.1.0.dev20230416+cu117 torchaudio 2.1.0.dev20230416+cu117 torchvision 0.16.0.dev20230416+cu117 tqdm 4.65.0 typing_extensions 4.4.0 tzdata 2023.3 urllib3 1.26.13 Werkzeug 2.2.3 wheel 0.38.4 yapf 0.33.0

xinyuliu-jeffrey commented 1 year ago

mmcv seems not compatible with torch>2.0 (as shown in here.) Please try to downgrade torch first (e.g., we used torch==1.11.0, or follow our installation in the classification here), then install mmcv 1.7 or earlier versions.

shiyuan7 commented 5 months ago

try to replace File "/home/xc/anaconda3/envs/seg/lib/python3.10/site-packages/mmcv/parallel/distributed.py", line 160, in _run_ddp_forward module_to_run = self._replicated_tensor_module if \ self._use_replicated_tensor_module else self.module with module_to_run = self.module