immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
45.46k stars 2.2k forks source link

Seems like immich does not handle http 500 requests coming from machine learning container properly #12876

Open sarunas-zilinskas opened 1 day ago

sarunas-zilinskas commented 1 day ago

The bug

Hi, first of all immich is an amazing platform for storing images, period! Thanks to everyone involved!

Seems like I have noticed an issue with immich job queuing. If machine learning container fails for some reason but not catastrophically, seems like immich does not handle 500 responses coming back from machine learning container. Here's what happens step by step:

  1. Immich server starts running face detection job

  2. Machine learning container fails (in my example, I presume there is a memory leak because after a while of restarting the container I get error: " [GPU] out of GPU resources " And this happens after processing several hundred of photos. However I am not 100% certain if this is a memory leak or some photo is heavy on load and it runs out of GPU resources. Nevertheless that's out of scope of the issue I am referring to and it's a whole separate issue)

    image
  3. ML container fails but not completely. On every request from server, ML container gives response as http 500 which is being logged on server:

    image

    Side note: Openvino should fall back to CPU processing but seems like that does not happen. Once again this is out of scope of this issue but including this to understand the issue better.

  4. Immich server in UI shows as amount of assets in queue processed as number is decreasing too fast till it hits 0. Side note: I have definitely pressed the "all" button on face detection. This indicates that face detection has not been processed.

  5. Immich server "flags" assets as processed however in fact they have not been processed because ML container shat its pants and gave back 500's on the rest of requests. Queue is down to 0 yet no faces are available in UI:

    image

I have reproduced it 2 times and results are the same. One time some of the faces have been processed (I guess those photos which have been detected before out of GPU resources issue occurs), the next time no faces have been available.

The OS that Immich Server is running on

Debian 11 - docker

Version of Immich Server

v1.115

Version of Immich Mobile App

not relevant

Platform with the issue

Your docker-compose.yml content

immich-server:
    image: ghcr.io/immich-app/immich-server:release
    container_name: immich_server
    hostname: *redacted*
    restart: always
    cpu_shares: 1024
    mem_limit: 10g
    devices:
      - /dev/dri:/dev/dri # For HW acc
    ports:
      - 2283:3001
    volumes:
      - *redacted*:/usr/src/app/upload
      - *redacted*:*redacted*:ro
      - /etc/localtime:/etc/localtime:ro
    environment:
      - TZ=*redacted*
      - UPLOAD_LOCATION=*redacted*
      - IMMICH_VERSION=release
      - DB_PASSWORD=*redacted*
      - DB_HOSTNAME=*redacted*
      - DB_USERNAME=*redacted*
      - DB_DATABASE_NAME=*redacted*
      - REDIS_HOSTNAME=*redacted*
    depends_on:
      - immich-redis
      - immich-database

  immich-machine-learning:
    image: ghcr.io/immich-app/immich-machine-learning:release-openvino
    container_name: immich_machine_learning
    hostname: *redacted*
    restart: always
    cpu_shares: 1024
    mem_limit: 20g
    # HW accel /*
    device_cgroup_rules:
      - 'c 189:* rmw'
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - /dev/bus/usb:/dev/bus/usb
    # HW accell */
    volumes:
      - *redacted*/model-cache:/cache
    environment:
      - TZ=*redacted*
      - UPLOAD_LOCATION=*redacted*
      - IMMICH_VERSION=release
      - DB_PASSWORD=*redacted*
      - DB_HOSTNAME=*redacted*
      - DB_USERNAME=*redacted*
      - DB_DATABASE_NAME=*redacted*
      - REDIS_HOSTNAME=*redacted*

  immich-redis:
    image: docker.io/redis:6.2-alpine@sha256:328fe6a5822256d065debb36617a8169dbfbd77b797c525288e465f56c1d392b
    container_name: immich_redis
    hostname: *redacted*
    restart: always
    cpu_shares: 1024
    mem_limit: 3g
    healthcheck:
      test: redis-cli ping || exit 1

  immich-database:
    image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    container_name: immich_postgres
    hostname: *redacted*
    restart: always
    cpu_shares: 1024
    mem_limit: 4g
    environment:
      - TZ=*redacted*
      - UPLOAD_LOCATION=*redacted*
      - IMMICH_VERSION=release
      - DB_PASSWORD=*redacted*
      - DB_HOSTNAME=*redacted*
      - DB_USERNAME=*redacted*
      - DB_DATABASE_NAME=*redacted*
      - REDIS_HOSTNAME=*redacted*
      - POSTGRES_PASSWORD=*redacted*
      - POSTGRES_USER=*redacted*
      - POSTGRES_DB=*redacted*
      - POSTGRES_INITDB_ARGS='--data-checksums'
    volumes:
      - *redacted*:/var/lib/postgresql/data
    command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "l>

Your .env content

Not used

Reproduction steps

  1. Make ML container respong with 500.
  2. Watch immich server ignoring 500's and queue gets down to 0 but no faces have been processed.
  3. ...

Relevant log output

No response

Additional information

I guess this is relevant to smart search as well. But I did not test it. Also this seems to be relevant as well: https://github.com/immich-app/immich/discussions/6347

bo0tzz commented 1 day ago

Thanks for the report! I think this is basically the same as #11981, right?

sarunas-zilinskas commented 15 hours ago

Did some further investigation and here's what I found: Started the face detection again with GPU enabled ML.

Screenshot 2024-09-24 at 00 44 09

From these metrics you can see that memory usage is jumping up and up. That happens every time GPU is out of resources. Looks like a memory leak to me. Now the CPU graph is at 0% when container inside is at idle after GPU is out of resources. I guess it somewhat tries to restart but fails because of allocated memory is not released. Now looking into I/O usage, we can see sudden increase in reads, that's when model is loaded into memory, but that happens only once, I guess when it tries to reload model into memory it should switch to CPU but for some reason it does not. Then after a while of stuck I guess in a limbo, network usage shows again increase in RX where it receives assets for processing but it throws back 500's and the graph goes up much quicker than it did before. This implies that assets are "processed" but in fact they are not and eventually when I did check my library - lots of faces are missing.

Just to add some technical info: I am using integrated GPU, intel UHD 605, host has 32GB of ram so there is plenty of memory which can be used. There is bunch of other services/containers running so it may skew the results somewhat but I did check the grafana for host monitoring and it does not get even close to 50% usage:

image

The annotations are for reference when I restarted both (server and ML) containers and started running face detection and 2nd annotation is when I got GPU out of resources error.