immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
39.04k stars 1.84k forks source link

Microservices container hangs up with a certain constellation of transcoding settings and hardware #9939

Open mshpp opened 1 month ago

mshpp commented 1 month ago

The bug

Immich is running in Docker, using a Proxmox LXC as host. Hardware acceleration is turned on and set up correctly. The CPU is AMD GX-415GA with Radeon HD8330E as the GPU.

Running "transcode all" with a certain combination of settings causes the microservices container to hang on a certain video. At this point, neither CPU nor GPU are utilized the way they should be when transcoding is in progress (htop shows low utilization, so nothing is running properly). The docker container can't be killed from inside the LXC, neither can the LXC be killed from the PVE host. The only way to get everything working as it should is to reboot the node. After rebooting, everything works fine as long as a transcoding job isn't started again, then everything repeats.

Transcoding configuration JSON:

{
  "ffmpeg": {
    "crf": 23,
    "threads": 3,
    "preset": "medium",
    "targetVideoCodec": "h264",
    "acceptedVideoCodecs": [
      "h264"
    ],
    "targetAudioCodec": "aac",
    "acceptedAudioCodecs": [
      "aac",
      "mp3",
      "libopus"
    ],
    "targetResolution": "480",
    "maxBitrate": "2000",
    "bframes": -1,
    "refs": 0,
    "gopSize": 0,
    "npl": 0,
    "temporalAQ": false,
    "cqMode": "auto",
    "twoPass": false,
    "preferredHwDevice": "auto",
    "transcode": "bitrate",
    "tonemap": "hable",
    "accel": "vaapi"
  },

A transcoding job has already run before without problems. From memory, the config differences were:

Preset: faster instead of medium Max bitrate: unset instead of 2000 Threads: unset instead of 3 Transcode policy: "only videos not in an accepted format" instead of "Videos higher than max bitrate or not in an accepted format"

I have modified the docker-compose file manually to store the Postgres DB in ./pgdata, this was before the breaking change with the docker-compose.yml. I have not made a docker compose pull since then.

The log pasted below is the last one to be seen before the container hangs. As mentioned before, nothing seems to actually be transcoded.

The OS that Immich Server is running on

Proxmox VE 8.1.5, LXC container with Debian 12

Version of Immich Server

v1.105.1

Version of Immich Mobile App

v1.105.1

Platform with the issue

Your docker-compose.yml content

version: '3.8'

#
# WARNING: Make sure to use the docker-compose.yml of the current release:
#
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
#
# The compose file on main may not be compatible with the latest release.
#

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: ['start.sh', 'immich']
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    ports:
      - 2283:3001
    depends_on:
      - redis
      - database
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/hardware-transcoding
      file: hwaccel.transcoding.yml
      service: vaapi # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    command: ['start.sh', 'microservices']
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
    env_file:
      - .env
    depends_on:
      - redis
      - database
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always

  redis:
    container_name: immich_redis
    image: registry.hub.docker.com/library/redis:6.2-alpine@sha256:51d6c56749a4243096327e3fb964a48ed92254357108449cb6e23999c37773c5
    restart: always

  database:
    container_name: immich_postgres
    image: registry.hub.docker.com/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - ./pgdata:/var/lib/postgresql/data
    restart: always

volumes:
  pgdata:
  model-cache:

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/insta>

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=[PASSWORD]

# The values below this line do not need to be changed
###################################################################################
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

REDIS_HOSTNAME=immich_redis

Reproduction steps

1. Enable VAAPI in LXC, create the necessary permissions for the container

lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.hook.pre-start: sh -c "chown 100000:111000 /dev/dri/renderD128"
lxc.cgroup2.devices.allow: c 235:0 rwm
lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file
lxc.hook.pre-start: sh -c "chown 100000:111000 /dev/kfd"
2. Start a transcoding job with the settings above
3. The docker container locks up completely, PVE node needs to be restarted

Relevant log output

[Nest] 7  - 06/02/2024, 10:04:03 AM     LOG [ImmichMicroservices] [MediaService] Started encoding video a5d278de-239c-4713-ba9e-a862024a525c {"inputOptions":["-init_hw_device vaapi=accel:/dev/dri/renderD128","-filter_hw_device accel"],"outputOptions":["-c:v h264_vaapi","-c:a copy","-movflags faststart","-fps_mode passthrough","-map 0:0","-map 0:1","-g 256","-v verbose","-vf format=nv12,hwupload,scale_vaapi=480:-2","-compression_level 4","-threads 3","-b:v 1380","-maxrate 2000","-minrate 690","-rc_mode 3"],"twoPass":false}

Additional information

No response

mertalev commented 2 weeks ago

How much RAM does the server have? If CPU/GPU utilization is low and things are freezing, I wonder if a lack of RAM is causing it to slow to a crawl.

mshpp commented 2 weeks ago

How much RAM does the server have? If CPU/GPU utilization is low and things are freezing, I wonder if a lack of RAM is causing it to slow to a crawl.

It has enough RAM, 8Gb to be exact, of which 4Gb are available to the LXC. However, there was no abnormal RAM usage, as I'd have noticed this in the Proxmox dashboard otherwise.

mertalev commented 2 weeks ago

Can you confirm if this issue still happens when using CPU instead of VAAPI?

mshpp commented 2 weeks ago

Well, not really -- the microservices container has been removed with the last update. However, by the very strange behaviour (neither Docker nor LXC can be terminated, not even with kill -9 unless the entire system is restarted), it seems that this is a problem with VAAPI and this particular GPU