tfaehse / DashcamCleaner

Censor identifiable information in videos, in particular dashcam recordings in Germany.
GNU Affero General Public License v3.0
130 stars 27 forks source link

DashcamCleaner hangs at 0% when using AMD GPU #71

Closed steffenWi closed 1 year ago

steffenWi commented 1 year ago

Hi,

Issue: When launching DashcamCleaner with CUDA being available through ROCm the program will hang at 0% while hogging one CPU core at 100% and keeping the GPU at 99%. This is with a video that is ~7 seconds long, has 30 FPS and has a resolution of 2704x1520px. I've let the process run for 5 minutes before aborting. I tried different arguments and both the UI and the CLI version.

When disabling CUDA/ROCm the same video with the same arguments takes ~3 minutes to complete.

How I launch the program with ROCm support: HSA_OVERRIDE_GFX_VERSION=10.3.0 python ./cli.py -i ~/trip.mp4 -o ./test.mp4 -t 0.8 -nf --weights 1080p_small_v8.pt -t 0.4 -nf --weights 1080p_small_v8.pt -s 2 -q 6 -f 2 -b 10 -r 1.1

Without ROCm support: CUDA_VISIBLE_DEVICES="1" python ./cli.py -i ~/trip.mp4 -o ./test.mp4 -t 0.4 -nf --weights 1080p_small_v8.pt -s 2 -q 6 -f 2 -b 10 -r 1.1

Where it hangs is this line:results_list = self.detector(images, imgsz=[scale]) inblurrer.py:detect_identifiable_information.

Information about my system:

CPU: AMD Ryzen 5 5600X GPU: AMD Radeon RX 5700 XT Kernel: 6.3.1-arch2-1 Distro: Arch Linux

Python version: 3.11.3

>>> import torch
>>> torch.cuda.is_available()
True
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> print(str(t))
tensor([5, 5, 5], device='cuda:0')

The attached log.txt shows the output that occurs when trying to utilize my GPU. log.txt

joshinils commented 1 year ago

I have no clue about pytorch, maybe @tfaehse can help with that. it may be a possible you need the rocm version of pytorch? https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/ but I can not find the supported gpu list they linked there...

Ich würde das auch gern verwenden, wenn das geht, hab auch beides cpu und gpu von amd auf ubuntu.

steffenWi commented 1 year ago

As for a list of supported GPUs, you can find that here and to figure out which GPU is meant by something like gfx900, you can see that here. You can also execute clinfo | grep gfx in a terminal and it'll output something like this Name: gfx1010:xnack-.

As you may notice, my GPU is not supported. More to the point no RDNA1 GPU is supported. To work around that you can use HSA_OVERRIDE_GFX_VERSION=10.3.0. That way pytorch/torchvision will think one is using a gfx1030 or Radeon RX 6800 based GPU. I've used that workaround on other projects before and it works just fine - usually.

steffenWi commented 1 year ago

Small Update: this isn't an issue with DashcamCleaner. I ran some tests and get the same hang with relatively simple python scripts as well. Seems there is something wrong with my installation or the GPU is simply unable to do some things

steffenWi commented 1 year ago

Just in case anyone else runs into this: Some of the things I tried running were the tests included in the 'roctracer' package. Some of them would run, others wouldn't. The ones that didn't run all crashed with segmentation faults.

export HSA_OVERRIDE_GFX_VERSION=10.3.0
./run.sh 0

would already crash for example. I got around to debugging this with gdb by running gdb ./test/MatrixTranspose and then enter run dry run at the gdb prompt. For all tests that failed the stacktrace would end up at this point:

hip::FatBinaryInfo::AddDevProgram (device_id=<optimized out>, this=<optimized out>) at /usr/src/debug/hip-runtime-amd/hipamd-rocm-5.4.3/src/hip_fatbin.cpp:122
122       if (fbd_info->add_dev_prog_ == false) { 

and in every case it turned out that fbd_info was unassigned/NULL. Looking at the source code of hip_fatbin.cpp I noticed that in the current version a change was made to fix the segmentation fault. This was added directly in front of the if statement:

if (fbd_info == nullptr) {
    return hipErrorInvalidKernelFile;
  }

I'm now going to wait until the 5.5 version hits the ArchLinux repository and see what happens then.

steffenWi commented 1 year ago

Short update. I downloaded the docker pytorch/rocm container, updated it and compiled everything for rocm 5.5 as that version is still not available for Arch Linux. Got the same exception as before. Then swapped my RDNA1 GPU for a RDNA2 GPU of a friend. With the RDNA2 GPU it works fine. Swapping back to RDNA1 it fails.