(Question) AMD ROCm support?

MooN-tm commented 1 year ago

Hi, still very new to all of this and after like 2 months of trying to make use of my 6900 XT and realising that AMD support is not that great across the board of ML/DL, just trying to figure out what actually works and what doesn't.

Have working ROCm and PyTorch installation, but not sure if it's supposed to work with conda. I mean, selecting PyTorch and ROCm on PyTorch website says that conda is not available, so I am wondering if it's okay to pip-install it inside of the conda env.

And while in terminal I get torch.cuda.is_available() = True , in VS Code Jupyter Notebook of the openmmlab conda env I get False so I am wondering - am I doing something wrong or am I just wasting my time and it just doesn't work?

Ubuntu 22.04.2 LTS
5.19.0-32-generic

Name:                    gfx1030                            
Marketing Name:          AMD Radeon RX 6900 XT

user@NZXT-H1-Ub: ~$ python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> 
>>> print(torch.__version__)
1.13.1+rocm5.2
>>> torch.cuda.is_available()
True

(openmmlab) user@NZXT-H1-Ub: ~/mmyolo$ python3
Python 3.8.16 (default, Jan 17 2023, 23:13:24) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.13.1+rocm5.2
>>> torch.cuda.is_available()
True

File ~/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmdet/apis/inference.py:82, in init_detector(config, checkpoint, palette, device, cfg_options)
     76         model.dataset_meta = {
     77             'classes': get_classes('coco'),
     78             'palette': palette
     79         }
     81 model.cfg = config  # save the config in the model for convenience
---> 82 model.to(device)
     83 model.eval()
     84 return model

File ~/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py:202, in BaseModel.to(self, *args, **kwargs)
    200 if device is not None:
    201     self._set_device(torch.device(device))
--> 202 return super().to(*args, **kwargs)

File ~/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py:989, in Module.to(self, *args, **kwargs)
    985         return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
...
    222 if _cudart is None:
    223     raise AssertionError(
    224         "libcudart functions unavailable. It looks like you have a broken build?")

AssertionError: Torch not compiled with CUDA enabled

Thank you.

hhaAndroid commented 1 year ago

@MooN-tm Why are the two python environments different? One is python3.8 and one is python3.10

MooN-tm commented 1 year ago

I'm not sure about the terminology still as I am quite new to any of this, but 3.10 is system/root/native(?) and the 3.8 is the one in conda (following the installation guide),

tested torch.cuda.is_available() in both and in terminal and got True from both, however when I try to use VS-Code to run the test, the CPU one will pass, but when device='cuda:0' I get the error AssertionError: Torch not compiled with CUDA enabled even though AMD's HiPify should have taken care of it and translate the AMD to nVidia's CUDA.

After system restart, the ROCm in Jupyter seems to be working, but now I get another error

Loads checkpoint by local backend from path: [/home/moon_tm/mmyolo/yolov5_s-v61_syncbn_fast_8xb16-300e_coco_20220918_084700-86e02187.pth](https://untitled+.vscode-resource.vscode-cdn.net/home/moon_tm/mmyolo/yolov5_s-v61_syncbn_fast_8xb16-300e_coco_20220918_084700-86e02187.pth)
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?4ff0ad88-7a07-4775-82b0-55bfa3d64a57)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 9
      7 checkpoint_file = '/home/moon_tm/mmyolo/yolov5_s-v61_syncbn_fast_8xb16-300e_coco_20220918_084700-86e02187.pth'
      8 model = init_detector(config_file, checkpoint_file, device='cuda:0')  # or device='cuda:0'
----> 9 inference_detector(model, '/home/moon_tm/mmyolo/demo/demo.jpg')

File ~/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmdet/apis/inference.py:150, in inference_detector(model, imgs, test_pipeline)
    148     # forward the model
    149     with torch.no_grad():
--> 150         results = model.test_step(data_)[0]
    152     result_list.append(results)
    154 if not is_batch:

File ~/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py:145, in BaseModel.test_step(self, data)
    136 """``BaseModel`` implements ``test_step`` the same as ``val_step``.
    137 
    138 Args:
   (...)
    142     list: The predictions of given data.
    143 """
    144 data = self.data_preprocessor(data, False)
--> 145 return self._run_forward(data, mode='predict')

File ~/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py:326, in BaseModel._run_forward(self, data, mode)
...
     30 if max_num > 0:
     31     inds = inds[:max_num]

RuntimeError: nms_impl: implementation for device cuda:0 not found.

open-mmlab / mmyolo

(Question) AMD ROCm support? #581