Facial Recognition does not use the GPU

Sri115 commented 1 month ago

Hi I am not sure if that is how it is supposed to work but the facial recognition job does not use the GPU rather the CPU in my case. Though the GPU is used for transcoding, the CPU runs at the maximum as soon as I upload images.

CPU goes burrrr - https://pasteboard.co/CTJ1iAREUffs.png GPU transcoding - https://pasteboard.co/IsbIjF0xWPbZ.png

I am not sure what I am missing here

loeeeee commented 1 month ago

CPU goes burrrr - https://pasteboard.co/CTJ1iAREUffs.png

There may be something else that is pushing CPU so hard. It is likely to be the image thumbnail extraction, sharp, that is using such power. It does so as well on my machine. However, I am not very certain about that. Can I see the screenshot of a task manager, e.g. top (the hard-core one), htop (the less fancy one), or btop (the fancy one) inside the container?

GPU transcoding - https://pasteboard.co/IsbIjF0xWPbZ.png

Regarding machine-learning not using the GPU, can I have the log file in /var/logs/immich/ml.log? It would be very helpful. Based on my uneducated guess, you may have missed the step at immich config -- this step tells the immich web server where to find the machine-learning backend.

Sri115 commented 1 month ago

Hi thanks for your quick followup

1) Here are some pictures from top, btop https://pasteboard.co/47Lc7bQDPXIJ.png https://pasteboard.co/DeeHGx7Kf9rZ.png

2) I am sure I setup the machine learning config as suggested. https://pasteboard.co/11b3RkVth9x0.png https://pasteboard.co/MF0GUtZiL3Vh.png

I am also attaching the ml.log here

Going through the log I see the error Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9. So I guess nvidia-cudnn was not correctly installed ? But installing again shows it is not an issue.

https://pasteboard.co/0rR45P0a9enQ.png

loeeeee commented 1 month ago

You are welcome. :)

Here are some pictures from top, btop

You are right about the machine learning. It is eating all of the CPUs.

I am sure I setup the machine learning config as suggested.

Your config is very correct.

I am also attaching the ml.log here

There are two type of errors in the log.

[ERROR] Can't connect to ('127.0.0.1', 3003)
[E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory

The first one does not matter since it disappeared after some time, which is probably because of startup sequence.

The second one is complaining about cannot find the cudnn components. For a Ubuntu machine, the dynamic library, i.e. these components, is at /usr/lib/x86_64-linux-gnu/. Based on the last screenshot, your machine has nvidia-cudnn 8.2.4.15 installed. As a result, the components available in lib folder would be something ends with ".8", while the immich machine-learning server would like to have a different version.

The cause of this problem is likely to be the distro package manager ships an old nvidia-cudnn, while the nvidia driver or immich or the dependency of immich, e.g. onnx runtime requires a later one.

To address this issue, I recommend uninstalling the nvidia-cudnn in the distro package manager, and install the latest one from NVIDIA's official website. (As the time of writing, it is 9.3.0) This should fix the issue. 😃

Sri115 commented 1 month ago

ok thanks for your inputs. Since this ticket is turning out to be more of an infrastructure problem on my side I think it can be closed. I will try out your suggestion and provide feedback over the weekend.

Also as a side question, jellyfin has just announced v1.112.1 . Do you plan to test and keep your repo up-to-date with every update from jellyfin as well ? That would be massive effort from your part but people like me would appreciate it.

loeeeee commented 1 month ago

Also as a side question, jellyfin has just announced v1.112.1 . Do you plan to test and keep your repo up-to-date with every update from jellyfin as well ? That would be massive effort from your part but people like me would appreciate it.

To be honest, I cannot promise a lot. However, I still wants to keep this project alive as long as possible, or become part of Immich one day. 😺

ok thanks for your inputs. Since this ticket is turning out to be more of an infrastructure problem on my side I think it can be closed. I will try out your suggestion and provide feedback over the weekend.

You are welcome. 😃 I always have issues with NVIDIA things as well, no worries.

loeeeee / immich-in-lxc

Facial Recognition does not use the GPU #7