Closed eSascha closed 3 weeks ago
Face-detection and facial-recognition jobs are failing with CUDA failure 999: unknown error
Ubuntu 22.04.5 LTS
116.2
version: "3.8" services: immich-server: container_name: immich_server image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release} ports: - "MY_PORT:3001" extends: file: hwaccel.transcoding.yml service: nvenc volumes: - ${UPLOAD_LOCATION}:/usr/src/app/upload - /etc/timezone:/etc/timezone:ro - /etc/localtime:/etc/localtime:ro env_file: - .env restart: always immich-machine-learning: container_name: immich_machine_learning image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-cuda deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: - gpu volumes: - model-cache:/cache env_file: - .env restart: always immich-database: container_name: immich_postgres image: tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0 env_file: - .env environment: POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_USER: ${DB_USERNAME} POSTGRES_DB: ${DB_DATABASE_NAME} volumes: - pgdata:/var/lib/postgresql/data restart: always volumes: pgdata: model-cache:
DB_HOSTNAME=immich_postgres DB_USERNAME=REDACTED_USERNAME DB_PASSWORD=REDACTED_PASSWORD DB_DATABASE_NAME=immich REDIS_HOSTNAME=IP_OF_REDIS_HOSTNAME REDIS_PORT=6379 REDIS_USERNAME=REDACTED_USERNAME REDIS_PASSWORD=REDACTED_PASSWORD UPLOAD_LOCATION=/path/to/immich PORT=3000 SERVER_PORT=3001 MICROSERVICES_PORT=3002 IMMICH_SERVER_URL=http://immich-server:3001 IMMICH_MACHINE_LEARNING_URL=http://immich-machine-learning:3003 MACHINE_LEARNING_HOST=0.0.0.0 MACHINE_LEARNING_PORT=3003
Every time I try to launch the facial-recognition or the face-detection jobs, I receive the errors attached in the log file.
immich cuda error.txt
2024-09-30T06:09:00.391145608Z [09/30/24 06:09:00] INFO Setting execution providers to 2024-09-30T06:09:00.391153427Z ['CUDAExecutionProvider', 'CPUExecutionProvider'], 2024-09-30T06:09:00.391156245Z in descending order of preference 2024-09-30T06:09:00.485280880Z *************** EP Error *************** 2024-09-30T06:09:00.485329394Z EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 999: unknown error ; GPU=834433856 ; hostname=0b5099759eb2 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=280 ; expr=cudaSetDevice(info_.device_id); 2024-09-30T06:09:00.485351618Z 2024-09-30T06:09:00.485354612Z when using ['CUDAExecutionProvider', 'CPUExecutionProvider'] 2024-09-30T06:09:00.485356792Z Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying. 2024-09-30T06:09:00.485358835Z **************************************** 2024-09-30T06:09:00.509912583Z [09/30/24 06:09:00] INFO Attempt #2 to load detection model 'buffalo_l' to 2024-09-30T06:09:00.509931712Z memory 2024-09-30T06:09:00.391145608Z [09/30/24 06:09:00] INFO Setting execution providers to 2024-09-30T06:09:00.511162035Z ['CUDAExecutionProvider', 'CPUExecutionProvider'], 2024-09-30T06:09:00.511165692Z in descending order of preference 2024-09-30T06:09:00.533384547Z *************** EP Error *************** 2024-09-30T06:09:00.533404199Z EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 999: unknown error ; GPU=32575 ; hostname=0b5099759eb2 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=280 ; expr=cudaSetDevice(info_.device_id); 2024-09-30T06:09:00.533408816Z 2024-09-30T06:09:00.533411027Z when using ['CUDAExecutionProvider', 'CPUExecutionProvider'] 2024-09-30T06:09:00.534361340Z Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying. 2024-09-30T06:09:00.534372232Z **************************************** 2024-09-30T06:09:00.963176543Z [09/30/24 06:09:00] ERROR Exception in ASGI application
GPU: Nvidia GTX 1050Ti, This is the output from the container:
root@0b5099759eb2:/usr/src/app# nvidia-smi Mon Sep 30 06:45:02 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1050 Ti Off | 00000000:01:00.0 Off | N/A | | 0% 46C P0 N/A / 90W | 0MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
cc @mertalev
No idea why, but after uninstalling the driver and installing it again, all ML jobs are working, no errors.
The bug
Face-detection and facial-recognition jobs are failing with CUDA failure 999: unknown error
The OS that Immich Server is running on
Ubuntu 22.04.5 LTS
Version of Immich Server
116.2
Version of Immich Mobile App
116.2
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
Every time I try to launch the facial-recognition or the face-detection jobs, I receive the errors attached in the log file.
immich cuda error.txt
Relevant log output
Additional information
GPU: Nvidia GTX 1050Ti, This is the output from the container: