immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
52.04k stars 2.76k forks source link

Exception in ASGI application - [GPU] out of GPU resources #13267

Closed effsee00 closed 1 month ago

effsee00 commented 1 month ago

The bug

When running Face Detection on a system configured with OpenVino ML - the immich_machine_learning container will eventually throw an exception complaining out of GPU resources. immich_machine_learning does not recover.

This has been recreated triggering "ALL" Face Detection, but the exception can appear at an indeterminate time after some success with face detection.

The OS that Immich Server is running on

DSM 7.2.1-69057 Update 5 | Docker version 20.10.23

Version of Immich Server

v1.117.0

Version of Immich Mobile App

v1.117.0 build.162

Platform with the issue

Your docker-compose.yml content

#
# WARNING: Make sure to use the docker-compose.yml of the current release:
#
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
#
# The compose file on main may not be compatible with the latest release.
#

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.transcoding.yml
    #   service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    volumes:
      # Do not edit the next line. If you want to change the media storage location on your system, edit the value of UPLOAD_LOCATION in the stack.env file
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - ${EXTERNAL_LOCATION}:/data/external:ro
    env_file:
      - stack.env
    ports:
      - 2283:3001
    depends_on:
      - redis
      - database
    restart: always
    healthcheck:
      disable: false

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}-openvino
   # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
   #   file: hwaccel.ml.yml
   #   service: openvino # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    device_cgroup_rules:
      - 'c 189:* rmw'
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - model-cache:/cache
      - /dev/bus/usb:/dev/bus/usb
      - ${EXTERNAL_LOCATION}:/data/external:ro
    env_file:
      - stack.env
    restart: always
    healthcheck:
      disable: false

  redis:
    container_name: immich_redis
    image: docker.io/redis:6.2-alpine@sha256:2d1463258f2764328496376f5d965f20c6a67f66ea2b06dc42af351f75248792
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always

  database:
    container_name: immich_postgres
    image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      # Do not edit the next line. If you want to change the database storage location on your system, edit the value of DB_DATA_LOCATION in the stack.env file
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    healthcheck:
      test: pg_isready --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' || exit 1; Chksum="$$(psql --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' --tuples-only --no-align --command='SELECT COALESCE(SUM(checksum_failures), 0) FROM pg_stat_database')"; echo "checksum failure count is $$Chksum"; [ "$$Chksum" = '0' ] || exit 1
      interval: 5m
#      start_interval: 30s
#      start_period: 5m
    command: ["postgres", "-c", "shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always

volumes:
  model-cache:

Your .env content

UPLOAD_LOCATION=/volume1/data/immich/library
EXTERNAL_LOCATION=/volume1/homes
DB_DATA_LOCATION=/volume2/docker/immich/postgres
TZ=Europe/London
IMMICH_VERSION=release
DB_PASSWORD=postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich
IMMICH_LOG_LEVEL=debug
IMMICH_AUTO_CREATE_ALBUM=true
MACHINE_LEARNING_WORKER_TIMEOUT=600

Reproduction steps

  1. Using an external library of ~8000 images / videos,
  2. Start Face Detection job "ALL"
  3. Wait for exception ...

Relevant log output

[10/07/24 16:07:31] DEBUG    Checking for inactivity...
[10/07/24 16:07:41] DEBUG    Checking for inactivity...
[10/07/24 16:07:42] DEBUG    Setting model format to onnx
[10/07/24 16:07:42] INFO     Loading detection model 'antelopev2' to memory
[10/07/24 16:07:42] DEBUG    Available ORT providers:
                             {'OpenVINOExecutionProvider',
                             'CPUExecutionProvider'}
[10/07/24 16:07:42] DEBUG    Available OpenVINO devices: ['CPU', 'GPU']
[10/07/24 16:07:42] INFO     Setting execution providers to
                             ['OpenVINOExecutionProvider',
[10/07/24 16:07:42] DEBUG    Setting execution provider options to
                             [{'device_type': 'GPU', 'precision': 'FP32',
                             'cache_dir':
                             '/cache/facial-recognition/antelopev2/detection/ope
                             nvino'}, {'arena_extend_strategy':
[10/07/24 16:07:42] DEBUG    Setting execution_mode to ORT_SEQUENTIAL
[10/07/24 16:07:42] DEBUG    Setting inter_op_num_threads to 0
[10/07/24 16:07:42] DEBUG    Setting intra_op_num_threads to 0
[10/07/24 16:07:43] DEBUG    Setting model format to onnx
[10/07/24 16:07:43] INFO     Loading recognition model 'antelopev2' to memory
[10/07/24 16:07:43] DEBUG    Available ORT providers:
                             {'OpenVINOExecutionProvider',
                             'CPUExecutionProvider'}
[10/07/24 16:07:43] DEBUG    Available OpenVINO devices: ['CPU', 'GPU']
[10/07/24 16:07:43] INFO     Setting execution providers to
                             ['OpenVINOExecutionProvider',
[10/07/24 16:07:43] DEBUG    Setting execution provider options to
                             [{'device_type': 'GPU', 'precision': 'FP32',
[10/07/24 16:07:43] DEBUG    Setting execution_mode to ORT_SEQUENTIAL
[10/07/24 16:07:43] DEBUG    Setting inter_op_num_threads to 0
[10/07/24 16:07:43] DEBUG    Setting intra_op_num_threads to 0
2024-10-07 16:07:46.107496327 [E:onnxruntime:, inference_session.cc:2105 operator()] Exception during initialization: /onnxruntime/onnxruntime/core/providers/openvino/ov_interface.cc:87 onnxruntime::openvino_ep::OVExeNetwork onnxruntime::openvino_ep::OVCore::CompileModel(std::shared_ptr<const ov::Model>&, std::string&, ov::AnyMap&, const std::string&) [OpenVINO-EP]  Exception while Loading Network for graph: OpenVINOExecutionProvider_OpenVINO-EP-subgraph_4_0Exception from src/inference/src/cpp/core.cpp:107:
Exception from src/inference/src/dev/plugin.cpp:53:
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:201:
[GPU] out of GPU resources

[10/07/24 16:07:46] ERROR    Exception in ASGI application

                             ╭─────── Traceback (most recent call last) ───────╮
                             │ /usr/src/app/main.py:152 in predict             │
                             │                                                 │
                             │   149 │   │   inputs = text                     │
                             │   150 │   else:                                 │
                             │   151 │   │   raise HTTPException(400, "Either  │
                             │ ❱ 152 │   response = await run_inference(inputs │
                             │   153 │   return ORJSONResponse(response)       │
                             │   154                                           │
                             │   155                                           │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ entries = (                                 │ │
                             │ │           │   [                             │ │
                             │ │           │   │   {                         │ │
                             │ │           │   │   │   'name': 'antelopev2', │ │
                             │ │           │   │   │   'task':               │ │
                             │ │           'facial-recognition',             │ │
                             │ │           │   │   │   'type': 'detection',  │ │
                             │ │           │   │   │   'options': {          │ │
                             │ │           │   │   │   │   'minScore': 0.7   │ │
                             │ │           │   │   │   }                     │ │
                             │ │           │   │   }                         │ │
                             │ │           │   ],                            │ │
                             │ │           │   [                             │ │
                             │ │           │   │   {                         │ │
                             │ │           │   │   │   'name': 'antelopev2', │ │
                             │ │           │   │   │   'task':               │ │
                             │ │           'facial-recognition',             │ │
                             │ │           │   │   │   'type':               │ │
                             │ │           'recognition',                    │ │
                             │ │           │   │   │   'options': {}         │ │
                             │ │           │   │   }                         │ │
                             │ │           │   ]                             │ │
                             │ │           )                                 │ │
                             │ │   image = b'\xff\xd8\xff\xe2\x01\xf0ICC_PR… │ │
                             │ │           \x00\x00mntrRGB XYZ               │ │
                             │ │           \x07\xe2\x00\x03\x00\x14\x00\t\x… │ │
                             │ │  inputs = <PIL.JpegImagePlugin.JpegImageFi… │ │
                             │ │           image mode=RGB size=1915x1440 at  │ │
                             │ │           0x7F24F4665950>                   │ │
                             │ │    text = None                              │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/main.py:177 in run_inference       │
                             │ │                  │   │   │   'type':        │ │
                             │ │                  'detection',               │ │
                             │ │                  │   │   │   'options': {   │ │
                             │ │                  │   │   │   │              │ │
                             │ │                  'minScore': 0.7            │ │
                             │ │                  │   │   │   }              │ │
                             │ │                  │   │   }                  │ │
                             │ │                  │   }                      │ │
                             │ │                  ]                          │ │
                             │ │   without_deps = [                          │ │
                             │ │                  │   {                      │ │
                             │   197 │   │   │   raise HTTPException(500, f"Fa │
                             │   198 │   │   with lock:                        │
                             │   199 │   │   │   try:                          │
                             │   201 │   │   │   except FileNotFoundError as e │
                             │   202 │   │   │   │   if model.model_format ==  │
                             │   203 │   │   │   │   │   raise e               │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ model = <app.models.facial_recognition.rec… │ │
                             │ │         object at 0x7f24f4a36e50>           │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/models/base.py:53 in load          │
                             │ │    self = <app.models.facial_recognition.r… │ │
                             │ │           object at 0x7f24f4a36e50>         │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/models/facial_recognition/recognit │
                             │ ion.py:28 in _load                              │
                             │                                                 │
                             │   25 │   │   self.batch = self.model_format ==  │
                             │   26 │                                          │
                             │   27 │   def _load(self) -> ModelSession:       │
                             │ ❱ 28 │   │   session = self._make_session(self. │
                             │   29 │   │   if self.batch and str(session.get_ │
                             │   30 │   │   │   self._add_batch_axis(self.mode │
                             │   31 │   │   │   session = self._make_session(s │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ self = <app.models.facial_recognition.reco… │ │
                             │ │        object at 0x7f24f4a36e50>            │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/models/base.py:110 in              │
                             │ _make_session                                   │
                             │                                                 │
                             │   107 │   │   │   case ".armnn":                │
                             │   108 │   │   │   │   session: ModelSession = A │
                             │   109 │   │   │   case ".onnx":                 │
                             │ ❱ 110 │   │   │   │   session = OrtSession(mode │
                             │   111 │   │   │   case _:                       │
                             │   112 │   │   │   │   raise ValueError(f"Unsupp │
                             │   113 │   │   return session                    │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ model_path = PosixPath('/cache/facial-reco… │ │
                             │ │       self = <app.models.facial_recognitio… │ │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │       model_path = PosixPath('/cache/facia… │ │
                             │ │ provider_options = None                     │ │
                             │ │        providers = None                     │ │
                             │ │             self = <app.sessions.ort.OrtSe… │ │
                             │ │                    object at                │ │
                             │ │                    0x7f24f4a34c50>          │ │
                             │ │     sess_options = None                     │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
                             │ ime/capi/onnxruntime_inference_collection.py:41 │
                             │ 9 in __init__                                   │
                             │                                                 │
                             │    416 │   │   disabled_optimizers = kwargs.get │
                             │    417 │   │                                    │
                             │    418 │   │   try:                             │
                             │ ❱  419 │   │   │   self._create_inference_sessi │
                             │        disabled_optimizers)                     │
                             │    420 │   │   except (ValueError, RuntimeError │
                             │    421 │   │   │   if self._enable_fallback:    │
                             │    422 │   │   │   │   try:                     │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ disabled_optimizers = None                  │ │
                             │ │              kwargs = {}                    │ │
                             │ │       path_or_bytes = '/cache/facial-recog… │ │
                             │ │    provider_options = [                     │ │
                             │ │                       │   {                 │ │
                             │ │                       │   │                 │ │
                             │ │                       'device_type': 'GPU', │ │
                             │ │                       │   │   'precision':  │ │
                             │ │                       'FP32',               │ │
                             │ │                       │   │   'cache_dir':  │ │
                             │ │                       '/cache/facial-recog… │ │
                             │ │                       │   },                │ │
                             │ │                       │   {                 │ │
                             │ │                       │   │                 │ │
                             │ │                       'arena_extend_strate… │ │
                             │ │                       'kSameAsRequested'    │ │
                             │ │                       │   }                 │ │
                             │ │                       ]                     │ │
                             │ │           providers = [                     │ │
                             │ │                       │                     │ │
                             │ │                       'OpenVINOExecutionPr… │ │
                             │ │                       │                     │ │
                             │ │                       'CPUExecutionProvide… │ │
                             │ │                       ]                     │ │
                             │ │                self = <onnxruntime.capi.on… │ │
                             │ │                       object at             │ │
                             │ │                       0x7f24f4a34d50>       │ │
                             │ │        sess_options = <onnxruntime.capi.on… │ │
                             │ │                       object at             │ │
                             │ │                       0x7f24f4664970>       │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
                             │ ime/capi/onnxruntime_inference_collection.py:49 │
                             │ 1 in _create_inference_session                  │
                             │                                                 │
                             │    488 │   │   │   disabled_optimizers = set(di │
                             │    489 │   │                                    │
                             │    490 │   │   # initialize the C++ InferenceSe │
                             │ ❱  491 │   │   sess.initialize_session(provider │
                             │    492 │   │                                    │
                             │    493 │   │   self._sess = sess                │
                             │    494 │   │   self._sess_options = self._sess. │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ available_providers = [                     │ │
                             │ │                       │                     │ │
                             │ │                       'OpenVINOExecutionPr… │ │
                             │ │                       │                     │ │
                             │ │                       'CPUExecutionProvide… │ │
                             │ │                       ]                     │ │
                             │ │ disabled_optimizers = set()                 │ │
                             │ │    provider_options = [                     │ │
                             │ │                       │   {                 │ │
                             │ │                       │   │                 │ │
                             │ │                       'device_type': 'GPU', │ │
                             │ │                       │   │   'precision':  │ │
                             │ │                       'FP32',               │ │
                             │ │                       │   │   'cache_dir':  │ │
                             │ │                       '/cache/facial-recog… │ │
                             │ │                       │   },                │ │
                             │ │                       │   {                 │ │
                             │ │                       │   │                 │ │
                             │ │                       'arena_extend_strate… │ │
                             │ │                       'kSameAsRequested'    │ │
                             │ │                       │   }                 │ │
                             │ │                       ]                     │ │
                             │ │           providers = [                     │ │
                             │ │                       │                     │ │
                             │ │                       'OpenVINOExecutionPr… │ │
                             │ │                       │                     │ │
                             │ │                       'CPUExecutionProvide… │ │
                             │ │                       ]                     │ │
                             │ │                self = <onnxruntime.capi.on… │ │
                             │ │                       object at             │ │
                             │ │                       0x7f24f4a34d50>       │ │
                             │ │                sess = <onnxruntime.capi.on… │ │
                             │ │                       object at             │ │
                             │ │                       0x7f24f46659f0>       │ │
                             │ │     session_options = <onnxruntime.capi.on… │ │
                             │ │                       object at             │ │
                             │ │                       0x7f24f4664970>       │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             ╰─────────────────────────────────────────────────╯
                             RuntimeException: [ONNXRuntimeError] : 6 :
                             RUNTIME_EXCEPTION : Exception during
                             initialization:
                             /onnxruntime/onnxruntime/core/providers/openvino/ov
                             _interface.cc:87
                             onnxruntime::openvino_ep::OVExeNetwork
                             onnxruntime::openvino_ep::OVCore::CompileModel(std:
                             :shared_ptr<const ov::Model>&, std::string&,
                             ov::AnyMap&, const std::string&) [OpenVINO-EP]
                             Exception while Loading Network for graph:
                             OpenVINOExecutionProvider_OpenVINO-EP-subgraph_4_0E
                             xception from src/inference/src/cpp/core.cpp:107:
                             Exception from src/inference/src/dev/plugin.cpp:53:
                             Exception from
                             src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cp
                             p:201:
                             [GPU] out of GPU resources

[10/07/24 16:07:47] INFO     Attempt #2 to load recognition model 'antelopev2'
                             to memory
[10/07/24 16:07:47] DEBUG    Available ORT providers:
                             {'OpenVINOExecutionProvider',
                             'CPUExecutionProvider'}
[10/07/24 16:07:47] DEBUG    Available OpenVINO devices: ['CPU', 'GPU']
[10/07/24 16:07:47] INFO     Setting execution providers to
                             ['OpenVINOExecutionProvider',
                             'CPUExecutionProvider'], in descending order of
[10/07/24 16:07:47] DEBUG    Setting execution provider options to
                             [{'device_type': 'GPU', 'precision': 'FP32',
[10/07/24 16:07:47] DEBUG    Setting execution_mode to ORT_SEQUENTIAL
[10/07/24 16:07:47] DEBUG    Setting inter_op_num_threads to 0
[10/07/24 16:07:47] DEBUG    Setting intra_op_num_threads to 0
2024-10-07 16:07:49.566109514 [E:onnxruntime:, inference_session.cc:2105 operator()] Exception during initialization: /onnxruntime/onnxruntime/core/providers/openvino/ov_interface.cc:87 onnxruntime::openvino_ep::OVExeNetwork onnxruntime::openvino_ep::OVCore::CompileModel(std::shared_ptr<const ov::Model>&, std::string&, ov::AnyMap&, const std::string&) [OpenVINO-EP]  Exception while Loading Network for graph: OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_0Exception from src/inference/src/cpp/core.cpp:107:
Exception from src/inference/src/dev/plugin.cpp:53:
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:201:
[GPU] out of GPU resources

[10/07/24 16:07:49] ERROR    Exception in ASGI application

                             ╭─────── Traceback (most recent call last) ───────╮
                             │ /usr/src/app/main.py:152 in predict             │
                             │                                                 │
                             │   149 │   │   inputs = text                     │
                             │   150 │   else:                                 │
                             │   151 │   │   raise HTTPException(400, "Either  │
                             │ ❱ 152 │   response = await run_inference(inputs │
                             │   153 │   return ORJSONResponse(response)       │
                             │   154                                           │
                             │   155                                           │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ entries = (                                 │ │
                             │ │           │   [                             │ │
                             │ │           │   │   {                         │ │
                             │ │           │   │   │   'name': 'antelopev2', │ │
                             │ │           │   │   │   'task':               │ │
                             │ │           'facial-recognition',             │ │
                             │ │           │   │   │   'type': 'detection',  │ │
                             │ │           │   │   │   'options': {          │ │
                             │ │           │   │   │   │   'minScore': 0.7   │ │
                             │ │           │   │   │   }                     │ │
                             │ │           │   │   }                         │ │
                             │ │           │   ],                            │ │
                             │ │           │   [                             │ │
                             │ │           │   │   {                         │ │
                             │ │           │   │   │   'name': 'antelopev2', │ │
                             │ │           │   │   │   'task':               │ │
                             │ │           'facial-recognition',             │ │
                             │ │           │   │   │   'type':               │ │
                             │ │           'recognition',                    │ │
                             │ │           │   │   │   'options': {}         │ │
                             │ │           │   │   }                         │ │
                             │ │           │   ]                             │ │
                             │ │           )                                 │ │
                             │ │   image = b'\xff\xd8\xff\xe2\x01\xf0ICC_PR… │ │
                             │ │           \x00\x00mntrRGB XYZ               │ │
                             │ │           \x07\xe2\x00\x03\x00\x14\x00\t\x… │ │
                             │ │  inputs = <PIL.JpegImagePlugin.JpegImageFi… │ │
                             │ │           image mode=RGB size=1915x1440 at  │ │
                             │ │           0x7F24F4B99950>                   │ │
                             │ │    text = None                              │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/main.py:177 in run_inference       │
                             │                                                 │
                             │   174 │   without_deps, with_deps = entries     │
                             │   178 │   if isinstance(payload, Image):        │
                             │   179 │   │   response["imageHeight"], response │
                             │   180                                           │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ _run_inference = <function                  │ │
                             │ │                  run_inference.<locals>._r… │ │
                             │ │                  at 0x7f24f4c009a0>         │ │
                             │ │        entries = (                          │ │
                             │ │                  │   [                      │ │
                             │ │                  │   │   {                  │ │
                             │ │                  │   │   │   'name':        │ │
                             │ │                  'antelopev2',              │ │
                             │ │                  │   │   │   'task':        │ │
                             │ │         func = <function                    │ │
                             │ │                load.<locals>._load at       │ │
                             │ │                0x7f24f4c03880>              │ │
                             │ │       kwargs = {}                           │ │
                             │ │ partial_func = functools.partial(<function  │ │
                             │ │                load.<locals>._load at       │ │
                             │ │                0x7f24f4c03880>,             │ │
                             │ │                <app.models.facial_recognit… │ │
                             │ │                object at 0x7f24f4a36e50>)   │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │   201 │   │   │   except FileNotFoundError as e │
                             │   202 │   │   │   │   if model.model_format ==  │
                             │   203 │   │   │   │   │   raise e               │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │         object at 0x7f24f4a36e50>           │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/models/base.py:53 in load          │
                             │                                                 │
                             │    50 │   │   self.download()                   │
                             │    51 │   │   attempt = f"Attempt #{self.load_a │
                             │       else "Loading"                            │
                             │    52 │   │   log.info(f"{attempt} {self.model_ │
                             │       '{self.model_name}' to memory")           │
                             │ ❱  53 │   │   self.session = self._load()       │
                             │    54 │   │   self.loaded = True                │
                             │    55 │                                         │
                             │    56 │   def predict(self, *inputs: Any, **mod │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ attempt = 'Attempt #2 to load'              │ │
                             │ │    self = <app.models.facial_recognition.r… │ │
                             │ │           object at 0x7f24f4a36e50>         │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/models/facial_recognition/recognit │
                             │ ion.py:28 in _load                              │
                             │                                                 │
                             │   25 │   │   self.batch = self.model_format ==  │
                             │   26 │                                          │
                             │   27 │   def _load(self) -> ModelSession:       │
                             │ ❱ 28 │   │   session = self._make_session(self. │
                             │   29 │   │   if self.batch and str(session.get_ │
                             │   30 │   │   │   self._add_batch_axis(self.mode │
                             │   31 │   │   │   session = self._make_session(s │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ self = <app.models.facial_recognition.reco… │ │
                             │ │        object at 0x7f24f4a36e50>            │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/models/base.py:110 in              │
                             │ _make_session                                   │
                             │                                                 │
                             │   107 │   │   │   case ".armnn":                │
                             │   108 │   │   │   │   session: ModelSession = A │
                             │   109 │   │   │   case ".onnx":                 │
                             │ ❱ 110 │   │   │   │   session = OrtSession(mode │
                             │   111 │   │   │   case _:                       │
                             │   112 │   │   │   │   raise ValueError(f"Unsupp │
                             │   113 │   │   return session                    │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ model_path = PosixPath('/cache/facial-reco… │ │
                             │ │       self = <app.models.facial_recognitio… │ │
                             │ │              object at 0x7f24f4a36e50>      │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /usr/src/app/sessions/ort.py:28 in __init__     │
                             │                                                 │
                             │    25 │   │   self.providers = providers if pro │
                             │    26 │   │   self.provider_options = provider_ │
                             │       self._provider_options_default            │
                             │    27 │   │   self.sess_options = sess_options  │
                             │       self._sess_options_default                │
                             │ ❱  28 │   │   self.session = ort.InferenceSessi │
                             │    29 │   │   │   self.model_path.as_posix(),   │
                             │    30 │   │   │   providers=self.providers,     │
                             │    31 │   │   │   provider_options=self.provide │
                             │                                                 │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │       model_path = PosixPath('/cache/facia… │ │
                             │ │ provider_options = None                     │ │
                             │ │        providers = None                     │ │
                             │ │             self = <app.sessions.ort.OrtSe… │ │
                             │ │                    object at                │ │
                             │ │                    0x7f24f4b9b010>          │ │
                             │ │     sess_options = None                     │ │
                             │ ╰─────────────────────────────────────────────╯ │
                             │                                                 │
                             │ /opt/venv/lib/python3.11/site-packages/onnxrunt │
                             │ ime/capi/onnxruntime_inference_collection.py:41 │
                             │ 9 in __init__                                   │
                             │                                                 │
                             │    416 │   │   disabled_optimizers = kwargs.get │
                             │    417 │   │                                    │
                             │    418 │   │   try:                             │
                             │ ❱  419 │   │   │   self._create_inference_sessi │
                             │        disabled_optimizers)                     │
                             │    420 │   │   except (ValueError, RuntimeError │
                             │    421 │   │   │   if self._enable_fallback:    │
                             │    422 │   │   │   │   try:                     │
                             │ ╭────────────────── locals ───────────────────╮ │
                             │ │ disabled_optimizers = None                  │ │
                             │ │              kwargs = {}                    │ │
                             │ │       path_or_bytes = '/cache/facial-recog… │ │
                             │ │    provider_options = [                     │ │
                             │ │                       │   {                 │ │
                             │ │                       │   │                 │ │
                             │ │                       'device_type': 'GPU', │ │
                             │ │                       │   │   'precision':  │ │
                             │ │                       'FP32',               │ │
                             │ │                       │   │   'cache_dir':  │ │
                             │ │                       '/cache/facial-recog… │ │
                             │ │                       │   {                 │ │
                             │ │                       │   │                 │ │
[10/07/24 16:07:51] DEBUG    Checking for inactivity...
[10/07/24 16:08:01] DEBUG    Checking for inactivity...

Additional information

Host hardware: Synology DS423+ CPU: INTEL Celeron J4125 Physical memory: 18 GB

AdoKevin commented 1 month ago

The same issue occurred while face detection with my device CPU INTEL N5095. I changed the model from buffalo_l to buffalo_m then this exception stopped. I believe thi is caused by the poor performance of my integrated GPU.

One more thing to metion is when this exception was thown, the undo queue got stuck, but the ml container still reported as healthy, perhaps some exeption handling logic should be investigated.