open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.13k stars 9.38k forks source link

Problems of AMP Training about Co-DETR Reimplement #11416

Open ysysys666 opened 8 months ago

ysysys666 commented 8 months ago

Notice

There are several common situations in the reimplementation issues as below

  1. Reimplement a model in the model zoo using the provided configs

Checklist

  1. I have searched related issues but cannot get the expected help.

Describe the issue

Excuese me ,does CO-DETR support AMP training? When I use AMP reimplement Co-DETR, meet the problem " RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source". After I add a type conversion. I meet the other problem "matched_row_inds, matched_col_inds = linear_sum_assignment(cost) ValueError: matrix contains invalid numeric entries" .

Reproduction

  1. What command or script did you run?
bash ./tools/dist_train.sh 'mmdetection/projects/CO-DETR/configs/codino/co_dino_5scale_r50_lsj_8xb2_1x_coco.py' 4 --work-dir 'mmdetection/outputs/codetr_5scale_r50_4xb4_12e_coco_results' --amp --auto-scale-lr --launcher 'pytorch'
  1. What config dir you run?
mmdetection/projects/CO-DETR/configs/codino/co_dino_5scale_r50_lsj_8xb2_1x_coco.py
  1. Did you make any modifications on the code or config? Did you understand what you have modified?

No

  1. What dataset did you use?

COCO

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here. sys.platform: linux Python: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.4, V11.4.48 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:
    • GCC 9.3
    • C++ Version: 201402
    • Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
    • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
    • OpenMP 201511 (a.k.a. OpenMP 4.5)
    • LAPACK is enabled (usually provided by MKL)
    • NNPACK is enabled
    • CPU capability usage: AVX2
    • CUDA Runtime 11.3
    • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
    • CuDNN 8.3.2 (built against CUDA 11.5)
    • Magma 2.5.2
    • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1 OpenCV: 4.8.1 MMEngine: 0.10.1 MMDetection: 3.2.0+fe3f809

  1. You may add addition that may be helpful for locating the problem, such as
    1. How you installed PyTorch [e.g., pip, conda, source] conda
    2. Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Results

If applicable, paste the related results here, e.g., what you expect and what you get.

A placeholder for results comparison

Issue fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Kobamiyannnn commented 7 months ago

I'm also having trouble encountering the same Issue.

makecent commented 7 months ago

Similar problems happen when use AmpOptimizer in DETR:

  File "/home/louis/miniconda3/envs/mmengine/lib/python3.8/site-packages/mmdet/models/dense_heads/detr_head.py", line 437, in _get_targets_single
    bbox_targets[pos_inds] = pos_gt_bboxes_targets
RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.
Cosmo1210 commented 6 months ago

Got same issues, have you solve it?

black-prince222 commented 5 months ago

got same issue

JackeyGuo commented 3 months ago

Got same issues, have you solve it?

Helen-Cheung commented 3 months ago

Similar problems happen when use AmpOptimizerWarpper in DETR

caiduoduo12138 commented 1 month ago

if anyone solve this problem?