[BUG] Machine Learning container fails to start after upgrading to v1.77.0 from v1.76.1

thariq-shanavas commented 1 year ago

The bug

I upgraded Immich and the machine learning container fails to start. Output of sudo docker logs -f immich_machine_learning

[09/06/23 10:35:59] INFO Booting worker with pid: 4585 [09/06/23 10:36:29] CRITICAL WORKER TIMEOUT (pid:4585) [09/06/23 10:36:31] ERROR Worker (pid:4585) was sent SIGKILL! Perhaps out of memory?

The container tries to restart, then fails with the same timeout error. I suspect a bug from https://github.com/immich-app/immich/pull/3934

I'm running on a system with 2 GB RAM (with 1 GB ZRAM and 1GB swap), so I've enabled only face recognition among the machine learning features. The processor is an Intel Atom Z8350. It works great in v1.76.1

In my .env file, I have pinned the version to v1.76.1 until this is resolved. Thank you all so much for this amazing software! I'll be happy to post any other logs as needed.

The OS that Immich Server is running on

Debian 12

Version of Immich Server

v1.77.0

Version of Immich Mobile App

NA

Platform with the issue

[X] Server
[ ] Web
[ ] Mobile

Your docker-compose.yml content

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "immich" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.yml
    #   service: hwaccel
    command: [ "start.sh", "microservices" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always

  immich-web:
    container_name: immich_web
    image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release}
    env_file:
      - .env
    restart: always

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
      # remove this to get debug messages
      - GLOG_minloglevel=1
    volumes:
      - tsdata:/data
    restart: always
  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL
      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always

volumes:
  pgdata:
  model-cache:
  tsdata:

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION= [Redacted]

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
# IMMICH_VERSION=release
IMMICH_VERSION=v1.76.1
# Connection secrets for postgres and typesense. You should change these to random passwords
TYPESENSE_API_KEY= [Redacted]
DB_PASSWORD= [Redacted]

# The values below this line do not need to be changed
###################################################################################
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

REDIS_HOSTNAME=immich_redis
#IMMICH_MACHINE_LEARNING_ENABLED=false
#TYPESENSE_ENABLED=false

Reproduction steps

1. Start the container with docker-compose up -d

Additional information

No response

raisinbear commented 1 year ago

Same here, got this message at least twice at initial startup, but seems to have resolved itself after some retries. Face recognition also seems to be working, the other options are disabled just like for OP.

thariq-shanavas commented 1 year ago

I noticed it a couple hours after the update, and it had not resolved itself. It probably restarted hundreds of times in that time frame. A reboot did not fix it either.

hachre commented 1 year ago

I had the same issue (also noticed after hours) but in my case a docker compose down --remove-orphans and docker compose up -d solved it for me...

alextran1502 commented 1 year ago

Cc @mertalev

mertalev commented 1 year ago

Looks like gunicorn gives workers 30s to start and terminates them if they don't start within this time. It might take longer than this for a worker to start on very slow CPUs. Setting --timeout to a higher number should fix it, maybe 120?

koffienl commented 1 year ago

Not sure if this is the same issue, but just did a clean install (v1.77.0) on a clean docker container with the stack file from the site. The machine-learning container won't finish the download and is stuck in a loop downloading over and over again.

There's plenty of CPU and mem for the container, but it's cutting off the download after 29 seconds.


/usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[09/08/23 15:15:14] ERROR    Worker (pid:121) was sent code 134!                
[09/08/23 15:15:14] INFO     Booting worker with pid: 135                       
[09/08/23 15:15:21] INFO     Created in-memory cache with unloading disabled.   
[09/08/23 15:15:21] INFO     Initialized request thread pool with 12 threads.   
09/08/23 15:15:21] INFO     Downloading facial-recognition model 'buffalo_l'.This may take a while.                 
09/08/23 15:15:21] WARNING  Failed to load facial-recognition model            
buffalo_l'.Clearing cache and retrying.           
[09/08/23 15:15:21] INFO     Cleared cache directory for model 'buffalo_l'.     
[09/08/23 15:15:21] INFO     Downloading facial-recognition model 'buffalo_l'.This may take a while.                 
Downloading /cache/facial-recognition/buffalo_l/buffalo_l.zip from https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_l.zip...
18%|█▊        | 50581/281857 [00:05<00:26, 8850.62KB/s]=```

OK, my bad .. thought this fix was already published/live but it wasn't.
Editten the start.sh file with the timeout and it started to work.

immich-app / immich