truenas / charts

TrueNAS SCALE Apps Catalogs & Charts
BSD 3-Clause "New" or "Revised" License
305 stars 293 forks source link

[Immich] No GPU devices available to machine learning pod #2150

Closed patrickli closed 8 months ago

patrickli commented 8 months ago

Hi, thanks for implementing machine learning images in #2146 for Immich.

However the machine learning pod does not have any GPUs available for hardware acceleration. Maybe expose it the same way as the micro services pod?

Expected:

root@nas[~]# k3s kubectl exec pods/immich-microservices-bdc89f566-8x5dk -n ix-immich -- ls -al /dev/dri
Defaulted container "immich" out of: immich, immich-init-postgres-wait (init), immich-init-redis-wait (init), immich-init-wait-url (init)
total 0
drwxr-xr-x 2 root root        80 Feb  8 13:37 .
drwxr-xr-x 6 root root       380 Feb  8 13:37 ..
crw-rw---- 1 root video 226,   0 Feb  8 13:37 card0
crw-rw---- 1 root   107 226, 128 Feb  8 13:37 renderD128

Actual:

root@nas[~]# k3s kubectl exec pods/immich-machinelearning-549fb6dff8-sv5vt -n ix-immich -- ls -al /dev/dri
Defaulted container "immich" out of: immich, immich-init-wait-url (init)
ls: cannot access '/dev/dri': No such file or directory
command terminated with exit code 2
phoropter commented 8 months ago

Hello @stavros-k, this change doesn't seem to work for me for an nvidia gpu. The chart itself runs fine, however in the scale UI it is stuck on "Deploying" even with ctrl+F5 refresh. Kubernetes events show 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.. after updating to the 3.0.9 chart version.

I also tried disabling then re-enabling the gpu in the chart settings, it didn't help.

stavros-k commented 8 months ago

What does this command show?

k3s kubectl get pods -A

Can you also send a screenshot of the GPU config in the app?

Thanks

phoropter commented 8 months ago
k3s kubectl get pods -A

``` NAMESPACE NAME READY STATUS RESTARTS AGE kube-system csi-nfs-node-7c9cd 3/3 Running 44 (5d19h ago) 87d kube-system csi-smb-node-knsqh 3/3 Running 44 (5d19h ago) 87d ix-vaultwarden vaultwarden-cnpg-main-pooler-rw-67669f4959-k72lf 1/1 Running 0 5d19h kube-system snapshot-controller-546868dfb4-tkfc5 1/1 Running 0 5d19h ix-kubernetes-reflector kubernetes-reflector-5449cc84f4-rxfzl 1/1 Running 0 5d19h ix-radarr radarr-76b85cbb65-s2f8s 1/1 Running 0 5d19h ix-cloudnative-pg cloudnative-pg-9644cbb4b-8sx9s 1/1 Running 1 (5d19h ago) 5d19h ix-metallb metallb-speaker-65vh2 4/4 Running 0 5d19h ix-auth auth-authentik-cnpg-main-1 1/1 Running 0 5d19h kube-system snapshot-controller-546868dfb4-8wpwf 1/1 Running 0 5d19h ix-vaultwarden vaultwarden-cnpg-main-pooler-rw-67669f4959-sqxz8 1/1 Running 0 5d19h ix-jellyseerr jellyseerr-8466b89c4f-bhnfm 1/1 Running 0 5d19h ix-sonarr sonarr-768fbd5675-nvx5v 1/1 Running 0 5d19h kube-system amdgpu-device-plugin-daemonset-g5qrk 1/1 Running 0 5d19h ix-prometheus-operator prometheus-operator-kps-operator-5bf5c6654f-bknwb 1/1 Running 0 5d19h ix-molly-socket molly-socket-custom-app-78c6d9b8fb-72hk2 1/1 Running 0 5d19h ix-theme-park theme-park-6596b49f95-brlmw 1/1 Running 0 5d19h ix-vaultwarden vaultwarden-cnpg-main-2 1/1 Running 0 5d19h kube-system metrics-server-68cf49699b-fwjcs 1/1 Running 0 5d19h ix-sonarr-anime sonarr-anime-55cb7d59f8-tcpsm 1/1 Running 0 5d19h ix-flaresolverr flaresolverr-76d69c9b6d-rfbhq 1/1 Running 0 5d19h ix-sabnzbd sabnzbd-7498f67959-cw44r 1/1 Running 0 5d19h ix-recyclarr recyclarr-7ccf647994-vxw79 2/2 Running 0 5d19h ix-cert-manager cert-manager-certmanager-webhook-688c8b8d4d-xdftv 1/1 Running 0 5d19h ix-metallb metallb-controller-5dc548fcfb-gwbhp 1/1 Running 0 5d19h kube-system csi-nfs-controller-7b74694749-8rmhs 4/4 Running 0 5d19h ix-nextcloud nextcloud-cnpg-main-1 1/1 Running 0 5d19h ix-protonmail protonmail-protonmail-bridge-775dfcc8f4-tlgdh 1/1 Running 0 5d19h ix-cert-manager cert-manager-certmanager-cainjector-67d89d8b5b-n7cjb 1/1 Running 0 5d19h ix-nextcloud nextcloud-cnpg-main-pooler-rw-7dd69c696c-k7krq 1/1 Running 0 5d19h ix-homepage homepage-58df678c97-jf9rg 2/2 Running 0 5d19h kube-system openebs-zfs-node-wmphk 2/2 Running 0 5d19h kube-system csi-smb-controller-7fbbb8fb6f-24l6t 3/3 Running 0 5d19h kube-system coredns-59b4f5bbd5-7xzfk 1/1 Running 0 5d19h ix-readarr readarr-6b6bd977f6-zjclw 1/1 Running 0 5d19h ix-cert-manager cert-manager-certmanager-557965ccc8-8s4sb 1/1 Running 0 5d19h kube-system nvidia-device-plugin-daemonset-qq25f 1/1 Running 0 5d19h ix-auth auth-authentik-cnpg-main-pooler-rw-7db4ccbbd9-qrgxb 1/1 Running 0 5d19h ix-g-readarr g-readarr-5968c955c7-lrjds 1/1 Running 0 5d19h ix-unpackerr unpackerr-656b7f7b67-jc26q 1/1 Running 0 5d19h ix-gonic gonic-55459bbf49-c8vx2 1/1 Running 0 5d19h ix-blocky blocky-redis-0 1/1 Running 0 5d19h ix-scrutiny scrutiny-6f6786b99-qzzbg 1/1 Running 0 5d19h ix-tailscale tailscale-d485959c4-klvdf 1/1 Running 1 (5d19h ago) 5d19h ix-auth auth-redis-0 1/1 Running 0 5d19h ix-ntfy ntfy-6b576b8589-685wx 1/1 Running 0 5d19h ix-bazarr bazarr-6c4db85f9b-pzs87 1/1 Running 1 (5d19h ago) 5d19h ix-vaultwarden vaultwarden-7887b86ccc-ljvcp 1/1 Running 0 5d19h kube-system openebs-zfs-controller-0 5/5 Running 0 5d19h ix-parkour-paradise parkour-paradise-minecraft-java-84b8b86b6-5xjwj 1/1 Running 4 (5d19h ago) 5d19h ix-blocky blocky-6999b4c797-pkhbp 2/2 Running 0 5d19h ix-minecraft-w minecraft-w-minecraft-java-77b8d8776c-9fxcn 1/1 Running 4 (5d19h ago) 5d19h ix-cube-survival cube-survival-minecraft-java-6c86559b66-f7bbb 1/1 Running 3 (5d19h ago) 5d19h ix-auth auth-authentik-worker-c977b7cd4-m42l7 1/1 Running 0 5d19h ix-auth auth-authentik-7776d7fc64-nnkvx 1/1 Running 0 5d19h ix-change-detection change-detection-custom-app-59464dbb76-jmrqn 1/1 Running 0 4d10h ix-minecraft-java minecraft-java-6d7c5c4fd4-twm29 1/1 Running 0 3d15h ix-qbittorrent qbittorrent-664cbc78d6-4b8gj 2/2 Running 0 2d23h ix-qbittorrent qbittorrent-qbitportforward-676cb875bb-mkjtn 1/1 Running 2 (2d23h ago) 2d23h ix-immich-tn immich-tn-redis-7499b458bf-vrxwc 1/1 Running 0 16h ix-immich-tn immich-tn-postgres-84d489685c-kcb5g 1/1 Running 0 16h ix-immich-tn immich-tn-microservices-557f87948c-5wvhz 0/1 Pending 0 16h ix-immich-tn immich-tn-7dfcb5cc84-7qcmk 1/1 Running 0 16h ix-immich-tn immich-tn-machinelearning-649d5bfc9c-rpt58 1/1 Running 0 16h ix-jellyfin jellyfin-5dff4ffc7d-kgxzg 1/1 Running 0 6h16m ix-emby emby-6876b49554-v5f8v 1/1 Running 0 6h16m ix-ollama ollama-ui-7f868df586-bxgdf 1/1 Running 0 6h15m ix-ollama ollama-5f79f6b8d5-6lf9m 1/1 Running 0 6h15m ix-prowlarr prowlarr-57f96c674b-hhwzl 1/1 Running 0 6h15m ix-omada omada-omada-controller-5f84ccb4cf-7xqdl 1/1 Running 0 6h15m ix-nextcloud nextcloud-redis-0 1/1 Running 0 6h14m ix-traefik traefik-689b5b4794-2mxgg 1/1 Running 0 6h14m ix-nextcloud nextcloud-collabora-7945454648-m4nj8 1/1 Running 0 6h14m ix-nextcloud nextcloud-imaginary-89546bd75-wt6ft 1/1 Running 0 6h14m ix-nextcloud nextcloud-84795499f7-9m96d 1/1 Running 0 6h14m ix-nextcloud nextcloud-notify-69f5b69ccf-wwt6l 1/1 Running 0 6h14m ix-nextcloud nextcloud-nginx-77f9548b94-kk7v6 1/1 Running 0 6h14m ```

I think this is what you mean by GPU config in the app?

image

stavros-k commented 8 months ago

Hello, Did it worked before the recent change with an nvidia GPU? What GPU do you have? Can you also grab logs from the nvidia-device-plugin?

k3s kubectl logs -n kube-system nvidia-device-plugin-daemonset-qq25f

Thanks

phoropter commented 8 months ago

Did it worked before the recent change with an nvidia GPU?

Yes, I've had the GPU mounted into the container for a few months now without issue. I'm using an A4000.

k3s kubectl logs -n kube-system nvidia-device-plugin-daemonset-qq25f

``` 2024/02/06 18:37:07 Starting FS watcher. 2024/02/06 18:37:07 Starting OS watcher. 2024/02/06 18:37:07 Starting Plugins. 2024/02/06 18:37:07 Loading configuration. 2024/02/06 18:37:07 Updating config with default resource matching patterns. 2024/02/06 18:37:07 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": { "failRequestsGreaterThanOne": true, "resources": [ { "name": "nvidia.com/gpu", "devices": "all", "replicas": 5 } ] } } } 2024/02/06 18:37:07 Retreiving plugins. 2024/02/06 18:37:07 Detected NVML platform: found NVML library 2024/02/06 18:37:07 Detected non-Tegra platform: /sys/devices/soc0/family file not found 2024/02/06 18:37:07 Starting GRPC server for 'nvidia.com/gpu' 2024/02/06 18:37:07 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2024/02/06 18:37:07 Registered device plugin for 'nvidia.com/gpu' with Kubelet ```

stavros-k commented 8 months ago

Hmm, was hoping we would have some errors there to work with. Is there any chance that you have assigned GPU to other apps? (max 5 pods can consume a single GPU) And Immich consume it with 2 pods if you assign it. So you are left with 3 pods.

phoropter commented 8 months ago

Ah, didn't realize it uses up 2 slots of the GPU allocation. Let me check, should only be active in emby & ollama.

phoropter commented 8 months ago

I'm sorry to have wasted your time. Turns out it was too many pods using the GPU. Forgot I added it to jellyfin, had it in the truecharts immich, but that is stopped and shouldn't affect this right?

stavros-k commented 8 months ago

I'm sorry to have wasted your time. Turns out it was too many pods using the GPU. Forgot I added it to jellyfin, had it in the truecharts immich, but that is stopped and shouldn't affect this right?

Not sure how the plugin parses the used GPUs, even if stopped, the manifests are still there. Its just saying "have 0 pods running".

I'll close this one now!

Glad you figured it out!