immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
38.86k stars 1.83k forks source link

Transcoding stuck #10560

Open Goodwu opened 1 week ago

Goodwu commented 1 week ago

The bug

It seems like ffmpeg running into deadloop. Ffmpeg running at 100% CPU usage for hours, and the output file size is only 48 bytes. HW decoding and encoding switch are all ON. Have accured two times, both have the same behavior. image

The OS that Immich Server is running on

OMV7

Version of Immich Server

v1.106.4

Version of Immich Mobile App

v1.106.4

Platform with the issue

Your docker-compose.yml content

#
# WARNING: Make sure to use the docker-compose.yml of the current release:
#
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
#
# The compose file on main may not be compatible with the latest release.
#

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    extends:
      file: hwaccel.transcoding.yml
      service: quicksync # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - ${EXT_LOCATION}:/media/photos
    env_file:
      - .env
    ports:
      - 2283:3001
    depends_on:
      - redis
      - database
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:d6c2911ac51b289db208767581a5d154544f2b2fe4914ea5056443f62dc6e900
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always

  database:
    container_name: immich_postgres
    image: tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    healthcheck:
      test: pg_isready --dbname='${DB_DATABASE_NAME}' || exit 1; Chksum="$$(psql --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' --tuples-only --no-align --command='SELECT SUM(checksum_failures) FROM pg_stat_database')"; echo "checksum failure count is $$Chksum"; [ "$$Chksum" = '0' ] || exit 1
      interval: 5m
      start_interval: 30s
      start_period: 5m
    command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always

volumes:
  model-cache:

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library
# The location where your database files are stored
DB_DATA_LOCATION=./postgres

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=postgres

EXT_LOCATION=/srv/dev-disk-by-uuid-2bf642cf-adb7-4ec8-9291-13bc77208fb0/store/WuStor_1/photo

# The values below this line do not need to be changed
###################################################################################
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

Reproduction steps

1. HW decoding and encoding switch ON
2. More than 2000 video files
3. Transcoding stuck after 1 whole day
...

Relevant log output

No response

Additional information

No response

mertalev commented 1 week ago

Hmm, I agree this seems like a bug in FFmpeg. Have you tried restarting the container?

Goodwu commented 1 week ago

Hmm, I agree this seems like a bug in FFmpeg. Have you tried restarting the container?

No, I killed the ffmpeg with -9 and it retried with software transcoding. And it continue transcoding with HW accel normally.

dronnikovigor commented 1 week ago

Same bug here. Had to restart container. But after restart it again starts to transcode and stuck. Had to terminate job. I was using HW transcode

mertalev commented 1 week ago

Same bug here. Had to restart container. But after restart it again starts to transcode and stuck.

Had to terminate job.

I was using HW transcode

Which acceleration API were you using?

dronnikovigor commented 1 week ago

Which acceleration API were you using?

vaapi. Disabled it for now.

Started encoding video 0885652b-e0c9-45ee-b837-32d0ea69c623 {"inputOptions":["-init_hw_device vaapi=accel:/dev/dri/card0","-filter_hw_device accel"],"outputOptions":["-c:v h264_vaapi","-c:a copy","-movflags faststart","-fps_mode passthrough","-map 0:1","-map 0:0","-g 256","-v verbose","-vf format=nv12,hwupload,scale_vaapi=1080:-2","-compression_level 7","-qp:v 23","-global_quality:v 23","-rc_mode 1"],"twoPass":false}

mertalev commented 1 week ago

Interesting, I wonder if it's specific to VAAPI then. I'll take a look at their issue tracker to see if this is a known issue.

mertalev commented 1 week ago

Can both of you clarify the following:

  1. Was transcoding concurrency set above 1?
  2. What kernel version does the server have?
  3. What model is the processor (or GPU if it's a discrete GPU)?

The only relevant info I've seen online is this issue along with the kernel bug it links to. But since the issue is quite old, I'm not sure if this is really it.

mertalev commented 1 week ago

There's also this issue that seems relevant.

Goodwu commented 1 week ago

Can both of you clarify the following:

  1. Was transcoding concurrency set above 1?
  2. What kernel version does the server have?
  3. What model is the processor (or GPU if it's a discrete GPU)?

The only relevant info I've seen online is this issue along with the kernel bug it links to. But since the issue is quite old, I'm not sure if this is really it.

  1. The transcoding concurrency is 1
  2. Linux b35022ea1bc4 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
  3. J3455
Goodwu commented 1 week ago

There's also this issue that seems relevant.

I have done more tests and it seems that it's all related to HDR videos. I also open this issue under media-driver project. When transcoding HDR video, it failed about 5 times and stuck 1 time in average.

dronnikovigor commented 1 day ago

Can both of you clarify the following:

Was transcoding concurrency set above 1? What kernel version does the server have? What model is the processor (or GPU if it's a discrete GPU)?

  1. transcoding concurrency is 1
  2. Linux 6.1.0-22-amd64 (omv 7)
  3. AMD Ryzen 5 5600G
mertalev commented 17 hours ago

It makes sense that it only fails for HDR videos since the command will only have the tonemap_opencl filter for HDR videos. Seems like that filter is where the issue lies, but I'm not sure what the root cause is.

Does disabling hardware decoding (and keeping hardware encoding enabled) avoid the issue?

mertalev commented 17 hours ago

Also worth noting that we use bleeding edge versions for the relevant dependencies following this PR. It may be interesting to try different versions of these to see if it affects the behavior.

dronnikovigor commented 17 hours ago

Does disabling hardware decoding (and keeping hardware encoding enabled) avoid the issue?

Can you advice how to set such config?

mertalev commented 16 hours ago

In the transcoding settings, there is a setting for the hardware acceleration API, and a setting for hardware decoding below it. If hardware decoding is disabled but you set it to use an acceleration API, it will accelerate encoding only and handle decoding and tone-mapping on CPU.

dronnikovigor commented 16 hours ago

It's said Applies only to NVENC and RKMPP. But i have VAAPI on my AMD

mertalev commented 12 hours ago

Hmm, that's a good point: the common thread here isn't OpenCL.

We know that:

That leaves QSV/VAAPI encoding as the most likely culprit. But that depends on whether the issue persists when hardware decoding for QSV is disabled. If it doesn't, it would poke a hole in this hypothesis.

mertalev commented 12 hours ago

It's also possible that these are just two different issues that happen to both cause FFmpeg to hang.