immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
49.6k stars 2.62k forks source link

HW acceleration error with CUDA (CUDA failure 999: unknown error) #13045

Closed eSascha closed 3 weeks ago

eSascha commented 3 weeks ago

The bug

Face-detection and facial-recognition jobs are failing with CUDA failure 999: unknown error

The OS that Immich Server is running on

Ubuntu 22.04.5 LTS

Version of Immich Server

116.2

Version of Immich Mobile App

116.2

Platform with the issue

Your docker-compose.yml content

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    ports:
      - "MY_PORT:3001"
    extends:
      file: hwaccel.transcoding.yml
      service: nvenc
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-cuda
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities:
                - gpu
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always

  immich-database:
    container_name: immich_postgres
    image: tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

volumes:
  pgdata:
  model-cache:

Your .env content

DB_HOSTNAME=immich_postgres
DB_USERNAME=REDACTED_USERNAME
DB_PASSWORD=REDACTED_PASSWORD
DB_DATABASE_NAME=immich
REDIS_HOSTNAME=IP_OF_REDIS_HOSTNAME
REDIS_PORT=6379
REDIS_USERNAME=REDACTED_USERNAME
REDIS_PASSWORD=REDACTED_PASSWORD
UPLOAD_LOCATION=/path/to/immich
PORT=3000
SERVER_PORT=3001
MICROSERVICES_PORT=3002
IMMICH_SERVER_URL=http://immich-server:3001
IMMICH_MACHINE_LEARNING_URL=http://immich-machine-learning:3003
MACHINE_LEARNING_HOST=0.0.0.0
MACHINE_LEARNING_PORT=3003

Reproduction steps

Every time I try to launch the facial-recognition or the face-detection jobs, I receive the errors attached in the log file.

immich cuda error.txt

Relevant log output

2024-09-30T06:09:00.391145608Z [09/30/24 06:09:00] INFO     Setting execution providers to                     
2024-09-30T06:09:00.391153427Z                              ['CUDAExecutionProvider', 'CPUExecutionProvider'], 
2024-09-30T06:09:00.391156245Z                              in descending order of preference                  
2024-09-30T06:09:00.485280880Z *************** EP Error ***************
2024-09-30T06:09:00.485329394Z EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 999: unknown error ; GPU=834433856 ; hostname=0b5099759eb2 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=280 ; expr=cudaSetDevice(info_.device_id); 
2024-09-30T06:09:00.485351618Z 
2024-09-30T06:09:00.485354612Z  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
2024-09-30T06:09:00.485356792Z Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
2024-09-30T06:09:00.485358835Z ****************************************
2024-09-30T06:09:00.509912583Z [09/30/24 06:09:00] INFO     Attempt #2 to load detection model 'buffalo_l' to  
2024-09-30T06:09:00.509931712Z                              memory          
2024-09-30T06:09:00.391145608Z [09/30/24 06:09:00] INFO   Setting execution providers to                     
2024-09-30T06:09:00.511162035Z                              ['CUDAExecutionProvider', 'CPUExecutionProvider'], 
2024-09-30T06:09:00.511165692Z                              in descending order of preference                  
2024-09-30T06:09:00.533384547Z *************** EP Error ***************
2024-09-30T06:09:00.533404199Z EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 999: unknown error ; GPU=32575 ; hostname=0b5099759eb2 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=280 ; expr=cudaSetDevice(info_.device_id); 
2024-09-30T06:09:00.533408816Z 
2024-09-30T06:09:00.533411027Z  when using ['CUDAExecutionProvider', 'CPUExecutionProvider']
2024-09-30T06:09:00.534361340Z Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
2024-09-30T06:09:00.534372232Z ****************************************
2024-09-30T06:09:00.963176543Z [09/30/24 06:09:00] ERROR    Exception in ASGI application

Additional information

GPU: Nvidia GTX 1050Ti, This is the output from the container:

root@0b5099759eb2:/usr/src/app# nvidia-smi
Mon Sep 30 06:45:02 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   46C    P0             N/A /   90W |       0MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
bo0tzz commented 3 weeks ago

cc @mertalev

eSascha commented 3 weeks ago

No idea why, but after uninstalling the driver and installing it again, all ML jobs are working, no errors.