immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
53.32k stars 2.82k forks source link

Inconsistencies on storage errors #14438

Open Chuckame opened 1 day ago

Chuckame commented 1 day ago

The bug

I just setup a NFS mount for the library, but it seems to not be as stable as I expected. Because of that, since the link is unstable, it can fail at any moment.

This failure ended up to library inconsistencies, where the entry is in the db while the file hasn't been uploaded properly. It shows the images as broken images, and logs are printed saying the file is not existing. So the file is considered uploaded, and isn't possible to re-upload (except one-by-one manually).

I suspect a weird upload behavior, where the photo is marked as uploaded in the db before being totally uploaded, so that the thumbnail is still generated but on an inexistent asset.

Here I challenge more the upload process than the mount issues. I've seen some similar gh issues because of a bad storage.

The OS that Immich Server is running on

docker - debian

Version of Immich Server

v1.121.0

Version of Immich Mobile App

v1.121.0 build 168

Platform with the issue

Your docker-compose.yml content

services:
  server:
    image: ghcr.io/immich-app/immich-server:v1.121.0
    volumes:
      - immich_data:/usr/src/app/upload
      - immich_thumbnails:/usr/src/app/upload/thumbs
    devices:
      - /dev/dri:/dev/dri # intel quicksync
    environment:
      DB_HOSTNAME: database
      REDIS_HOSTNAME: redis
      TZ: Europe/Paris
      IMMICH_MACHINE_LEARNING_URL: http://machine-learning:3003
      DB_USERNAME: ${DB_USERNAME}
      DB_PASSWORD: ${DB_PASSWORD}
      DB_DATABASE_NAME: ${DB_DATABASE_NAME}
      PUID: ${PUID}
      PGID: ${PGID}
    depends_on:
      - redis
      - database
    restart: always
    networks:
      - internal
      - caddy
    labels:
      caddy: REDACTED
      caddy.reverse_proxy: '{{upstreams http 2283}}'

  machine-learning:
    image: ghcr.io/immich-app/immich-machine-learning:v1.121.0
    volumes:
      - model-cache:/cache
    restart: always
    networks:
      - internal

  redis:
    image: redis:6.2-alpine@sha256:eaba718fecd1196d88533de7ba49bf903ad33664a92debb24660a922ecd9cac8
    restart: always
    healthcheck:
      test: redis-cli ping || exit 1
    networks:
      - internal

  database:
    image: tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always
    networks:
      - internal
    healthcheck:
      test: pg_isready --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' || exit 1; Chksum="$$(psql --dbname='${DB_DATABASE_NAME}' --username='${DB_USERNAME}' --tuples-only --no-align --command='SELECT COALESCE(SUM(checksum_failures), 0) FROM pg_stat_database')"; echo "checksum failure count is $$Chksum"; [ "$$Chksum" = '0' ] || exit 1
      interval: 5m
      start_interval: 30s
      start_period: 5m
    command: ["postgres", "-c" ,"shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]

volumes:
  pgdata:
  model-cache:
  immich_data:
    driver_opts:
      o: addr=${NFS_SERVER},rw,hard,nfsvers=4,tcp,noexec,timeo=10
      type: nfs
      device: :/mnt/user/Photos/immich
  immich_thumbnails:
    driver_opts:
      o: addr=${NFS_SERVER},rw,hard,nfsvers=4,tcp,noexec,timeo=10
      type: nfs
      device: :/mnt/user/immich-thumbnails

networks:
  caddy:
    external: true
  internal:

Your .env content

PUID=33
PGID=33
NFS_SERVER=10.253.0.0
DB_PASSWORD=REDACTED
DB_USERNAME=postgres
DB_DATABASE_NAME=immich
IMMICH_API_KEY=REDACTED

Reproduction steps

  1. mount unstable storage
  2. upload photo using immich background sync

Relevant log output

[Nest] 17  - 11/29/2024, 9:22:32 AM   ERROR [Api:LoggerRepository~bbre49ji] Unable to send file: Error
Error: ENOENT: no such file or directory, access 'upload/thumbs/a398e325-c5ed-49a1-9c01-d422130604a8/47/26/47263b55-b7c1-451e-b4cd-54f1aa062533-thumbnail.webp'
    at async access (node:internal/fs/promises:605:10)
    at async sendFile (/usr/src/app/dist/utils/file.js:50:9)
    at async AssetMediaController.viewAsset (/usr/src/app/dist/controllers/asset-media.controller.js:58:9)
[Nest] 17  - 11/29/2024, 9:22:32 AM   ERROR [Api:GlobalExceptionFilter~bbre49ji] Unknown error: Error: ENOENT: no such file or directory, access 'upload/thumbs/a398e325-c5ed-49a1-9c01-d422130604a8/47/26/47263b55-b7c1-451e-b4cd-54f1aa062533-thumbnail.webp'
Error: ENOENT: no such file or directory, access 'upload/thumbs/a398e325-c5ed-49a1-9c01-d422130604a8/47/26/47263b55-b7c1-451e-b4cd-54f1aa062533-thumbnail.webp'
    at async access (node:internal/fs/promises:605:10)
    at async sendFile (/usr/src/app/dist/utils/file.js:50:9)
    at async AssetMediaController.viewAsset (/usr/src/app/dist/controllers/asset-media.controller.js:58:9)

[Nest] 17  - 11/28/2024, 6:37:46 PM   ERROR [Api:ErrorInterceptor~0tkgg8oq] Unknown error: Error: Unknown system error -116: Unknown system error -116, mkdir 'upload/upload/a398e325-c5ed-49a1-9c01-d422130604a8/dd/19'
Error: Unknown system error -116: Unknown system error -116, mkdir 'upload/upload/a398e325-c5ed-49a1-9c01-d422130604a8/dd/19'
    at mkdirSync (node:fs:1363:26)
    at StorageRepository.mkdirSync (/usr/src/app/dist/repositories/storage.repository.js:133:37)
    at AssetMediaService.getUploadFolder (/usr/src/app/dist/services/asset-media.service.js:82:32)
    at /usr/src/app/dist/middleware/file-upload.interceptor.js:139:52
    at callbackify (/usr/src/app/dist/middleware/file-upload.interceptor.js:74:31)
    at FileUploadInterceptor.destination (/usr/src/app/dist/middleware/file-upload.interceptor.js:139:16)
    at DiskStorage._handleFile (/usr/src/app/node_modules/multer/storage/disk.js:31:8)
    at FileUploadInterceptor.handleFile (/usr/src/app/dist/middleware/file-upload.interceptor.js:149:29)
    at /usr/src/app/node_modules/multer/lib/make-middleware.js:137:17
    at callbackify (/usr/src/app/dist/middleware/file-upload.interceptor.js:74:16)

[Nest] 7  - 11/29/2024, 12:00:00 AM   ERROR [Microservices:JobService] Unable to run job handler (thumbnailGeneration/generate-thumbnails): Error: ffprobe exited with code 1
ffprobe version 7.0.2-Jellyfin Copyright (c) 2007-2024 the FFmpeg developers
  built with gcc 12 (Debian 12.2.0-14)
  configuration: --prefix=/usr/lib/jellyfin-ffmpeg --target-os=linux --extra-version=Jellyfin --disable-doc --disable-ffplay --disable-ptx-compression --disable-static --disable-libxcb --disable-sdl2 --disable-xlib --enable-lto=auto --enable-gpl --enable-version3 --enable-shared --enable-gmp --enable-gnutls --enable-chromaprint --enable-opencl --enable-libdrm --enable-libxml2 --enable-libass --enable-libfreetype --enable-libfribidi --enable-libfontconfig --enable-libharfbuzz --enable-libbluray --enable-libmp3lame --enable-libopus --enable-libtheora --enable-libvorbis --enable-libopenmpt --enable-libdav1d --enable-libsvtav1 --enable-libwebp --enable-libvpx --enable-libx264 --enable-libx265 --enable-libzvbi --enable-libzimg --enable-libfdk-aac --arch=amd64 --enable-libshaderc --enable-libplacebo --enable-vulkan --enable-vaapi --enable-amf --enable-libvpl --enable-ffnvcodec --enable-cuda --enable-cuda-llvm --enable-cuvid --enable-nvdec --enable-nvenc
  libavutil      59.  8.100 / 59.  8.100
  libavcodec     61.  3.100 / 61.  3.100
  libavformat    61.  1.100 / 61.  1.100
  libavdevice    61.  1.100 / 61.  1.100
  libavfilter    10.  1.100 / 10.  1.100
  libswscale      8.  1.100 /  8.  1.100
  libswresample   5.  1.100 /  5.  1.100
  libpostproc    58.  1.100 / 58.  1.100
upload/upload/a398e325-c5ed-49a1-9c01-d422130604a8/42/67/4267d5b7-a6e3-4eb2-8238-d2372ea676e7.mp4: No such file or directory

Additional information

No response

mmomjian commented 1 day ago

I personally don't view this as an Immich issue - having reliable backend storage is not something that is in Immich's control, and should be managed on the server / deployment side. Will leave open for a bit in case someone disagrees. Otherwise, I would say this is a feature request for "more resilient to regular network storage disconnection"

Chuckame commented 1 day ago

@mmomjian I'm not asking for retries or better handling of bad storage, I totally agree with that. I'm challenging the upload process, where maybe there is a try/catch to improve to not have inconsistencies.

After double check, it seems that the original file exists, while the thumbnail doesn't exist, and refreshing the thumbnail doesn't work as it still tries to access the thumbnail.

Currently, I'm unable to fix the issue because immich threat it as valid asset while the thumbnail literally doesn't exist and isn't able to be re-generated

Chuckame commented 1 day ago

Oh, I confirm back that some original files are missing. I don't know how but the Metadata and faces has been extracted well, while the original file isn't present anymore.

I found that the corruption appeared only once when I restarted the remote storage.

I'm currently in a weird state, where data has been extracted while immich fails reading the image

bo0tzz commented 1 day ago

Is your networks storage getting unmounted? If so, you'll find the missing originals on the local disk at the path where the network storage is normally mounted.

Chuckame commented 1 day ago

No, I've found that it was because of a reboot, but finally the nfs mount seems stable the rest of the time. I think the issue has been a sudden timeout on storage operations.

Still, I don't understand how the faces/Metadata has been extracted while there is nothing on the disk 🤔

bo0tzz commented 1 day ago

because of a reboot

Sounds likely that the mount didn't come up at boot and Immich ended up writing to the local disk. Did you look whether there are files there?

Chuckame commented 1 day ago

When it's unmounted (container stopped), the folder is completely empty. Btw, normally I cannot start the container if the mount fails, and we can't mount on a non-empty folder 🤔

Chuckame commented 23 hours ago

Here is the order of errors:

Chuckame commented 23 hours ago

How to remove those assets to force the upload from the phones ?

Also, is there currently a job or cli command which verifies if the assets have the corresponding files with the corresponding hash ? Like an integrity check