open-mmlab / mmcv

OpenMMLab Computer Vision Foundation
https://mmcv.readthedocs.io/en/latest/
Apache License 2.0
5.86k stars 1.64k forks source link

Error in AMD GPU 6800xt(gfx1030) Rocm5.2.1 using mmcv 2.0.0rc1 #2312

Open zRzRzRzRzRzRzR opened 2 years ago

zRzRzRzRzRzRzR commented 2 years ago

Checklist

  1. I have searched related issues #1394 but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The unexpected results still exist in the latest version: mmcv 2.0.0.rc1

Describe the Issue

I created a new environment when I configured MMyolo, and after configuring it according to the documentation, when I run the demo program, the following error is reported if I specify the GPU as Cuda.

1 What command, code, or script did you run?

python demo/image_demo.py demo/demo.jpg \
                          yolov5_s-v61_syncbn_fast_8xb16-300e_coco.py \
                          yolov5_s-v61_syncbn_fast_8xb16-300e_coco_20220918_084700-86e02187.pth \
                          --device cuda \
                          --out-file result.jpg
  1. Did you make any modifications on the code? Did you understand what you have modified? Just change ‘‘cpu’’ to “cuda”

Environment

  1. my environment is below
    
    sys.platform: linux
    Python: 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0]
    CUDA available: True
    numpy_random_seed: 2147483648
    GPU 0: AMD Radeon RX 6800 XT
    CUDA_HOME: /opt/rocm-5.2.1
    NVCC: Not Available
    GCC: x86_64-linux-gnu-gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
    PyTorch: 1.12.1+rocm5.1.1
    PyTorch compiling details: PyTorch built with:
    - GCC 7.3
    - C++ Version: 201402
    - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
    - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
    - OpenMP 201511 (a.k.a. OpenMP 4.5)
    - LAPACK is enabled (usually provided by MKL)
    - NNPACK is enabled
    - CPU capability usage: AVX2
    - HIP Runtime 5.1.20531
    - MIOpen 2.16.0
    - Magma 2.6.1
    - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=ON, 

TorchVision: 0.13.1+rocm5.1.1 OpenCV: 4.6.0 MMEngine: 0.1.0 MMCV: 2.0.0rc1 MMDetection: 3.0.0rc1 MMYOLO: 0.1.1+

2. You may add addition that may be helpful for locating the problem, such as
I installed pytorch using pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1, and  I tested it in yolov5 and yolov7 source code; both can normally run, including training and detecting.

**Error traceback**

Traceback (most recent call last): File "/media/zr/Data/MMLAB_2.0/mmyolo/demo/image_demo.py", line 61, in main(args) File "/media/zr/Data/MMLAB_2.0/mmyolo/demo/image_demo.py", line 43, in main result = inference_detector(model, args.img) File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmdet/apis/inference.py", line 152, in inference_detector results = model.teststep(data)[0] File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 145, in test_step return self._run_forward(data, mode='predict') # type: ignore File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 298, in _run_forward results = self(data, mode=mode) File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmdet/models/detectors/base.py", line 94, in forward return self.predict(inputs, data_samples) File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmdet/models/detectors/single_stage.py", line 110, in predict results_list = self.bbox_head.predict( File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 196, in predict predictions = self.predict_by_feat( File "/media/zr/Data/MMLAB_2.0/mmyolo/mmyolo/models/dense_heads/yolov5_head.py", line 406, in predict_by_feat results = self._bbox_post_process( File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 478, in _bbox_post_process det_bboxes, keep_idxs = batched_nms(bboxes, results.scores, File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmcv/ops/nms.py", line 334, in batched_nms dets, keep = nms_op(boxes_for_nms, scores, nmscfg) File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmengine/utils/misc.py", line 351, in new_func output = old_func(args, kwargs) File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmcv/ops/nms.py", line 159, in nms inds = NMSop.apply(boxes, scores, iou_threshold, offset, score_threshold, File "/media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmcv/ops/nms.py", line 27, in forward inds = ext_module.nms( RuntimeError: nms_impl: implementation for device cuda:0 not found.

Exception raised from Dispatch at /tmp/mmcv/mmcv/ops/csrc/common/pytorch_device_registry.hpp:122 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbae3043ab2 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x7fbae304014b in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: nms_impl(at::Tensor, at::Tensor, float, int) + 0xa97 (0x7fb9b64e2847 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so) frame #3: nms(at::Tensor, at::Tensor, float, int) + 0x4f (0x7fb9b64e2fcf in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so) frame #4: + 0x12515b (0x7fb9b652515b in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so) frame #5: + 0x11224f (0x7fb9b651224f in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so)

frame #10: THPFunction_apply(_object*, _object*) + 0xb57 (0x7fbb5ef24937 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #57: + 0x29d90 (0x7fbb82629d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #58: __libc_start_main + 0x80 (0x7fbb82629e40 in /lib/x86_64-linux-gnu/libc.so.6) ``` Now my device can only use CPU training and verification, and I want to know what I should do.
HAOCHENYE commented 2 years ago

Hi! how do you install MMCV2.0.0rc1? We've not provided pre-built package for rocm, so you need to compile MMCV2.0 from source.

zRzRzRzRzRzRzR commented 2 years ago

Hi! how do you install MMCV2.0.0rc1? We've not provided pre-built package for rocm, so you need to compile MMCV2.0 from source.

Thank you for your reply. I found the mmcv2.0.0rc1 source code on the "Release" page, which was released on Aug 31, 2022. And I tried to use method like issue #1394

(venv) /media/zr/Data/MMLAB_2.0/mmcv-2.0.0rc1 MMCV_WITH_OPS=1 pip install -e .
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///media/zr/Data/MMLAB_2.0/mmcv-2.0.0rc1
  Preparing metadata (setup.py) ... done
Requirement already satisfied: addict in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmcv==2.0.0rc1) (2.4.0)
Requirement already satisfied: mmengine in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmcv==2.0.0rc1) (0.1.0)
Requirement already satisfied: numpy in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmcv==2.0.0rc1) (1.23.3)
Requirement already satisfied: packaging in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmcv==2.0.0rc1) (21.3)
Requirement already satisfied: Pillow in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmcv==2.0.0rc1) (9.2.0)
Requirement already satisfied: pyyaml in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmcv==2.0.0rc1) (6.0)
Requirement already satisfied: yapf in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmcv==2.0.0rc1) (0.32.0)
Requirement already satisfied: matplotlib in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmengine->mmcv==2.0.0rc1) (3.6.0)
Requirement already satisfied: opencv-python>=3 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmengine->mmcv==2.0.0rc1) (4.6.0.66)
Requirement already satisfied: termcolor in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from mmengine->mmcv==2.0.0rc1) (2.0.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from packaging->mmcv==2.0.0rc1) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from matplotlib->mmengine->mmcv==2.0.0rc1) (4.37.4)
Requirement already satisfied: kiwisolver>=1.0.1 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from matplotlib->mmengine->mmcv==2.0.0rc1) (1.4.4)
Requirement already satisfied: python-dateutil>=2.7 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from matplotlib->mmengine->mmcv==2.0.0rc1) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from matplotlib->mmengine->mmcv==2.0.0rc1) (0.11.0)
Requirement already satisfied: contourpy>=1.0.1 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from matplotlib->mmengine->mmcv==2.0.0rc1) (1.0.5)
Requirement already satisfied: six>=1.5 in /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->mmengine->mmcv==2.0.0rc1) (1.16.0)
Installing collected packages: mmcv
  Running setup.py develop for mmcv
    error: subprocess-exited-with-error

    × python setup.py develop did not run successfully.
    │ exit code: 1
    ╰─> [1064 lines of output]

If I use an older version of mmcv, this will affect the operation of the mmyolo module, as this module requires mmcv version 2.0.0rc1. Release 1.6.2(latest) also has this problem. Compiling through the source code, the problem is not solved.The error is below.

 /media/zr/Data/MMLAB_2.0/venv/lib/python3.10/site-packages/torch/include/c10/util/complex.h:8:10: fatal error: 'thrust/complex.h' file not found
    #include <thrust/complex.h>
             ^~~~~~~~~~~~~~~~~~
    26 warnings and 1 error generated when compiling for gfx1030.
    error: command '/opt/rocm-5.2.1/bin/hipcc' failed with exit code 1

What should I do?

HAOCHENYE commented 2 years ago

Hi, sorry for my late reply, you need to compile mmcv like this:

MMCV_WITH_OPS=1 ROCM_HOME=/opt/rocm-4.0.0 python3 setup.py install

where ROCM_HOME is your local path to your rocm enviroment.

zRzRzRzRzRzRzR commented 2 years ago

Describe the Issue Following the method you provided does not solve the problem. cuda:0 cannot be found in either the virtual or physical environment. My guess is that there is a problem calling the cuda operator in the _ext module. Error traceback is the same as the previous problem. Error traceback

RuntimeError: nms_impl: implementation for device cuda:0 not found.

Exception raised from Dispatch at /tmp/mmcv/mmcv/ops/csrc/common/pytorch_device_registry.hpp:122 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2684e43ab2 in /home/zr/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f2684e4014b in /home/zr/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: nms_impl(at::Tensor, at::Tensor, float, int) + 0xa97 (0x7f25534e2847 in /home/zr/.local/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so)
frame #3: nms(at::Tensor, at::Tensor, float, int) + 0x4f (0x7f25534e2fcf in /home/zr/.local/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x12515b (0x7f255352515b in /home/zr/.local/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x11224f (0x7f255351224f in /home/zr/.local/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so)
<omitting python frames>
frame #10: THPFunction_apply(_object*, _object*) + 0xb57 (0x7f2700d24937 in /home/zr/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #57: <unknown function> + 0x29d90 (0x7f2723c29d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #58: __libc_start_main + 0x80 (0x7f2723c29e40 in /lib/x86_64-linux-gnu/libc.so.6)

I can't downgrade my Rocm to a lower version, the version I'm using is Rocm-5.2.1, so I can't be sure if it's because of a version problem.

zRzRzRzRzRzRzR commented 2 years ago

The compilation process is shown in this message, and I think it was done successfully.

sudo MMCV_WITH_OPS=1 ROCM_HOME=/opt/rocm-5.2.1 python3 setup.py install 
[sudo] password for zr: 
Skip building ext ops due to the absence of torch.
running install
/usr/lib/python3/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/usr/lib/python3/dist-packages/setuptools/command/easy_install.py:158: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.16.0-unknown is an invalid version and will not be supported in a future release
  warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
  warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
  warnings.warn(
running bdist_egg
running egg_info
writing mmcv.egg-info/PKG-INFO
writing dependency_links to mmcv.egg-info/dependency_links.txt
writing requirements to mmcv.egg-info/requires.txt
writing top-level names to mmcv.egg-info/top_level.txt
reading manifest file 'mmcv.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'

...

Using /usr/local/lib/python3.10/dist-packages/cycler-0.11.0-py3.10.egg
Searching for contourpy==1.0.5
Best match: contourpy 1.0.5
Processing contourpy-1.0.5-py3.10-linux-x86_64.egg
contourpy 1.0.5 is already the active version in easy-install.pth

Using /usr/local/lib/python3.10/dist-packages/contourpy-1.0.5-py3.10-linux-x86_64.egg
Finished processing dependencies for mmcv==2.0.0rc1

The mmcv python package can also be found successfully in the local environment.

pip list | grep mmcv
mmcv                           2.0.0rc1         /home/zr/.local/lib/python3.10/site-packages
HAOCHENYE commented 2 years ago

image It seems the building has been skipped for the absence of torch.

zRzRzRzRzRzRzR commented 2 years ago

Maybe it's the python version or something, I was able to compile pytorch in my local environment and it works fine.This still confuses me.

~ pip list | grep torch            
torch                          1.12.1+rocm5.1.1
torchaudio                     0.12.1+rocm5.1.1
torchvision                    0.13.1+rocm5.1.1
~ python 
Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.get_device_properties(torch.device('cuda:0'))
_CudaDeviceProperties(name='AMD Radeon RX 6800 XT', major=10, minor=3, total_memory=16368MB, multi_processor_count=36)
>>> torch.cuda.is_available()
True
zRzRzRzRzRzRzR commented 2 years ago

My guess is that it might be because my torch is a package downloaded from the official website using pip, and I'll try next to see if compiling the torch using the source code will fix the problem. Thank you for your kind help.

chekistcccp commented 1 year ago

have you ever solved this problem? i cant get over this either

chekistcccp commented 1 year ago

Hi, sorry for my late reply, you need to compile mmcv like this:

MMCV_WITH_OPS=1 ROCM_HOME=/opt/rocm-4.0.0 python3 setup.py install

where ROCM_HOME is your local path to your rocm enviroment.

i have same problem as Author error: #include <thrust/complex.h> ^~~~~~ 26 warnings and 1 error generated when compiling for gfx1030.

HAOCHENYE commented 1 year ago

Hi, sorry for my late reply, you need to compile mmcv like this:

MMCV_WITH_OPS=1 ROCM_HOME=/opt/rocm-4.0.0 python3 setup.py install

where ROCM_HOME is your local path to your rocm enviroment.

i have same problem as Author error: #include <thrust/complex.h> ^~~~~~ 26 warnings and 1 error generated when compiling for gfx1030.

Hi, have you installed the rocm?