immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
45.09k stars 2.18k forks source link

Hardware accelarated Machine Learning - Worker (pid:165) was sent code 134! #12120

Closed TiemoW closed 1 week ago

TiemoW commented 2 weeks ago

The bug

Hi, i have issues with hardware accelarated machine learning with the openvino-image and a passed through igpu (Intel i5-11400 - Rocket Lake UHD 730). The passthrough is recognized by the vm itself but when i start any machine learning task, it gives me the mentioned error.

Machine learning without hardware accelaration does work flawless.

The OS that Immich Server is running on

Debian 12 VM in Proxmox VE 8.2.4 (Kernelversion Linux 6.8.12-1-pve)

Version of Immich Server

v1.112.1

Version of Immich Mobile App

1.112.1 build.169

Platform with the issue

Your docker-compose.yml content

#
# WARNING: Make sure to use the docker-compose.yml of the current release:
#
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
#
# The compose file on main may not be compatible with the latest release.
#

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.transcoding.yml
    #   service: quicsync # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    devices:
      - /dev/dri:/dev/dri

    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - stack.env
    ports:
      - 2283:3001
    depends_on:
      - redis
      - database
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-openvino
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: ${ML_FILE}
    #   service: openvino # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    device_cgroup_rules:
      - 'c 189:* rmw'
    devices:
      - /dev/dri:/dev/dri
    group_add:
      - '105'
    volumes:
      - /dev/bus/usb:/dev/bus/usb
      - ./model-cache:/cache
    environment:
      - NEOReadDebugKeys=1
      - OverrideGpuAddressSpace=48
      - ORT_OPENVINO_ENABLE_CI_LOG=1
      - ORT_OPENVINO_ENABLE_DEBUG=1
      - OPENVINO_LOG_LEVEL=5
      - LOG_LEVEL=debug
    env_file:
      - stack.env
    restart: always

  redis:
    container_name: immich_redis
    image: docker.io/redis:6.2-alpine@sha256:328fe6a5822256d065debb36617a8169dbfbd77b797c525288e465f56c1d392b
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always

  database:
    container_name: immich_postgres
    image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    healthcheck:
      test: pg_isready --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' || exit 1; Chksum="$$(psql --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' --tuples-only --no-align --command='SELECT COALESCE(SUM(checksum_failures), 0) FROM pg_stat_database')"; echo "checksum failure count is $$Chksum"; [ "$$Chksum" = '0' ] || exit 1
      interval: 5m
      #start_interval: 30s
      start_period: 5m
    command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always

Your .env content

IMMICH_VERSION=release
DB_PASSWORD=*
DB_USERNAME=*
DB_DATABASE_NAME=*
IMMICH_LOG_LEVEL=verbose
MACHINE_LEARNING_WORKER_TIMEOUT=300

Reproduction steps

  1. Start any machine learning task with the specified setup/docker compose
  2. ...

Relevant log output

[08/29/24 17:33:40] INFO     Booting worker with pid: 165                       
[08/29/24 17:33:41] DEBUG    Could not load ANN shared libraries, using ONNX:   
                             libmali.so: cannot open shared object file: No such
                             file or directory                                  
[08/29/24 17:33:45] INFO     Started server process [165]                       
[08/29/24 17:33:45] INFO     Waiting for application startup.                   
[08/29/24 17:33:45] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[08/29/24 17:33:45] INFO     Initialized request thread pool with 4 threads.    
[08/29/24 17:33:45] DEBUG    Checking for inactivity...                         
[08/29/24 17:33:45] INFO     Application startup complete.                      
[08/29/24 17:33:45] DEBUG    Setting model format to onnx                       
[08/29/24 17:33:45] INFO     Loading visual model 'ViT-B-32__openai' to memory  
[08/29/24 17:33:45] DEBUG    Loading visual preprocessing config for CLIP model 
                             'ViT-B-32__openai'                                 
[08/29/24 17:33:45] DEBUG    Loaded visual preprocessing config for CLIP model  
                             'ViT-B-32__openai'                                 
[08/29/24 17:33:45] DEBUG    Available ORT providers:                           
                             {'OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'}                            
[08/29/24 17:33:45] DEBUG    Available OpenVINO devices: ['CPU', 'GPU']         
[08/29/24 17:33:45] INFO     Setting execution providers to                     
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of   
                             preference                                         
[08/29/24 17:33:45] DEBUG    Setting execution provider options to              
                             [{'device_type': 'GPU', 'precision': 'FP32',       
                             'cache_dir':                                       
                             '/cache/clip/ViT-B-32__openai/visual/openvino'},   
                             {'arena_extend_strategy': 'kSameAsRequested'}]     
[08/29/24 17:33:45] DEBUG    Setting execution_mode to ORT_SEQUENTIAL           
[08/29/24 17:33:45] DEBUG    Setting inter_op_num_threads to 0                  
[08/29/24 17:33:45] DEBUG    Setting intra_op_num_threads to 0                  
In the OpenVINO EP
Model is fully supported on OpenVINO
Abort was called at 70 line in file:
./shared/source/program/program_initialization.cpp
[08/29/24 17:33:50] ERROR    Worker (pid:165) was sent code 134!

Additional information

No response

TiemoW commented 1 week ago

I could resolve the issue.

Settings for vm on proxmox host:

Set processor to "host".