immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
44.83k stars 2.18k forks source link

[BUG] Facial recognition is progressing even when the machine learning container is down #2646

Closed madadam closed 1 year ago

madadam commented 1 year ago

The bug

I have the machine learning container running on a different machine than the rest of the stack, but that machine is not always online. I noticed that the facial recognition job was completed (active: 0, waiting: 0) even though the machine learning machine was offline. Clicking the "MISSING" button on the job did nothing. Clicking "ALL" restarts the job but that loses all the previously recognized faces (from the time the ML container was up)

What I think is happening is the facial recognition job runs but when the request to the ML container fails it still ticks that image as processed and moves on to another one. So in the end all the images are treated as processed by the facial recognition even though they are not. So the job thinks it has nothing left to do and considers itself complete.

What I think should happen is that if the ML container is not accessible the facial recognition job should stop and resume only when the machine learning container is online again.

NOTE: This same issue might affect the other ML jobs as well but I haven't checked.

The OS that Immich Server is running on

OSMC (Debian) on Raspberry PI 4 + Ubuntu 22.04 (only the machine learning container)

Version of Immich Server

v1.59.1

Version of Immich Mobile App

not a mobile issue

Platform with the issue

Your docker-compose.yml content

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:release
    command: ["start-server.sh"]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:release
    command: ["start-microservices.sh"]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

# EDIT: Machine learning runs on a different machine
#  immich-machine-learning:
#    container_name: immich_machine_learning
#    image: ghcr.io/immich-app/immich-machine-learning:release
#    volumes:
#      - ${UPLOAD_LOCATION}:/usr/src/app/upload
#      - model-cache:/cache
#    env_file:
#      - .env
#    restart: always

  immich-web:
    container_name: immich_web
    image: ghcr.io/immich-app/immich-web:release
    env_file:
      - .env
    restart: always

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.0
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
    logging:
      driver: none
    volumes:
      - tsdata:/data
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      PG_DATA: /var/lib/postgresql/data
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:release
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL
      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    depends_on:
      - immich-server
    restart: always

volumes:
  pgdata:
  model-cache:
  tsdata:

Your .env content

# NOTE: The following four database variables support Docker secrets by adding a *_FILE suffix to the variable name
# See the docker-compose documentation on secrets for additional details: https://docs.docker.com/compose/compose-file/compose-file-v3/#secrets
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_DATABASE_NAME=immich

# Optional Database settings:
# DB_PORT=5432

###################################################################################
# Redis
###################################################################################

REDIS_HOSTNAME=immich_redis

# REDIS_URL will be used to pass custom options to ioredis.
# Example for Sentinel
# {"sentinels":[{"host":"redis-sentinel-node-0","port":26379},{"host":"redis-sentinel-node-1","port":26379},{"host":"redis-sentinel-node-2","port":26379}],"name":"redis-sentinel"}
# REDIS_URL=ioredis://eyJzZW50aW5lbHMiOlt7Imhvc3QiOiJyZWRpcy1zZW50aW5lbDEiLCJwb3J0IjoyNjM3OX0seyJob3N0IjoicmVkaXMtc2VudGluZWwyIiwicG9ydCI6MjYzNzl9XSwibmFtZSI6Im15bWFzdGVyIn0=

# Optional Redis settings:

# Note: these parameters are not automatically passed to the Redis Container
# to do so, please edit the docker-compose.yml file as well. Redis is not configured
# via environment variables, only redis.conf or the command line

# REDIS_PORT=6379
# REDIS_DBINDEX=0
# REDIS_USERNAME=
# REDIS_PASSWORD=
# REDIS_SOCKET=

###################################################################################
# Upload File Location
#
# This is the location where uploaded files are stored.
###################################################################################

UPLOAD_LOCATION=/media/sklad/photos/immich

###################################################################################
# Typesense
###################################################################################
TYPESENSE_API_KEY=eKr/DBr9zgOMCoi7XmQCbHAqiBKUC00kXx9jHxezxmV56Jhq3Uq9ELsYtbvXfKrT4VJjFBbgU9fB1oxmgLmFVw==
# TYPESENSE_ENABLED=false
# TYPESENSE_URL uses base64 encoding for the nodes json.
# Example JSON that was used:
# [
#      { 'host': 'typesense-1.example.net', 'port': '443', 'protocol': 'https' },
#      { 'host': 'typesense-2.example.net', 'port': '443', 'protocol': 'https' },
#      { 'host': 'typesense-3.example.net', 'port': '443', 'protocol': 'https' },
#  ]
# TYPESENSE_URL=ha://WwogICAgeyAnaG9zdCc6ICd0eXBlc2Vuc2UtMS5leGFtcGxlLm5ldCcsICdwb3J0JzogJzQ0MycsICdwcm90b2NvbCc6ICdodHRwcycgfSwKICAgIHsgJ2hvc3QnOiAndHlwZXNlbnNlLTIuZXhhbXBsZS5uZXQnLCAncG9ydCc6ICc0NDMnLCAncHJvdG9jb2wnOiAnaHR0cHMnIH0sCiAgICB7ICdob3N0JzogJ3R5cGVzZW5zZS0zLmV4YW1wbGUubmV0JywgJ3BvcnQnOiAnNDQzJywgJ3Byb3RvY29sJzogJ2h0dHBzJyB9LApd

###################################################################################
# Reverse Geocoding
#
# Reverse geocoding is done locally which has a small impact on memory usage
# This memory usage can be altered by changing the REVERSE_GEOCODING_PRECISION variable
# This ranges from 0-3 with 3 being the most precise
# 3 - Cities > 500 population: ~200MB RAM
# 2 - Cities > 1000 population: ~150MB RAM
# 1 - Cities > 5000 population: ~80MB RAM
# 0 - Cities > 15000 population: ~40MB RAM
####################################################################################

# DISABLE_REVERSE_GEOCODING=false
# REVERSE_GEOCODING_PRECISION=3

####################################################################################
# WEB - Optional
#
# Custom message on the login page, should be written in HTML form.
# For example:
# PUBLIC_LOGIN_PAGE_MESSAGE="This is a demo instance of Immich.<br><br>Email: <i>demo@demo.de</i><br>Password: <i>demo</i>"
####################################################################################

PUBLIC_LOGIN_PAGE_MESSAGE=

####################################################################################
# Alternative Service Addresses - Optional
#
# This is an advanced feature for users who may be running their immich services on different hosts.
# It will not change which address or port that services bind to within their containers, but it will change where other services look for their peers.
# Note: immich-microservices is bound to 3002, but no references are made
####################################################################################

IMMICH_WEB_URL=http://immich-web:3000
IMMICH_SERVER_URL=http://immich-server:3001

# IMMICH_MACHINE_LEARNING_URL=http://immich-machine-learning:3003
IMMICH_MACHINE_LEARNING_URL=http://192.168.1.112:3003

####################################################################################
# Alternative API's External Address - Optional
#
# This is an advanced feature used to control the public server endpoint returned to clients during Well-known discovery.
# You should only use this if you want mobile apps to access the immich API over a custom URL. Do not include trailing slash.
# NOTE: At this time, the web app will not be affected by this setting and will continue to use the relative path: /api
# Examples: http://localhost:3001, http://immich-api.example.com, etc
####################################################################################

#IMMICH_API_URL_EXTERNAL=http://localhost:3001

Reproduction steps

1. Upload a bunch of photos with faces
2. Go to the Jobs screen in the web interface and observe the facial recognition job starting
3. Bring down the machine learning container
4. Observe the facial recognition job is still running to completion

NOTE: I haven't actually tried the above steps exactly. In my case the ML is running on a different machine and I simply shut that machine down. But I think these simpler steps should be sufficient to reproduce it.

Additional information

No response

jrasm91 commented 1 year ago

This was actually a bug: #2618

In the next release the "missing" job should work as expected and you can re-process the skipped files as expected.

jrasm91 commented 1 year ago

You are also describing an "enhancement", which is don't process ML-related jobs if the ML container is offline. Alternatively you could also manually pause/resume the corresponding jobs when ML is started/stopped.

Pheggas commented 1 year ago

@madadam as this issue has been solved by #2618 , could you please close this issue?

madadam commented 1 year ago

I don't have an easy way to verify the fix right now. Will close anyway and in case I encounter more issues, I'll reopen. Thanks!

Pheggas commented 1 year ago

No problem. I hope it's been solved tho!