immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
50.37k stars 2.67k forks source link

Immich Server not starting at 1.117.0 #13317

Closed PassionateBytes closed 3 weeks ago

PassionateBytes commented 3 weeks ago

The bug

Status Quo:

Upgrade Process:

Observations:

The OS that Immich Server is running on

K3S Cluster on Fedora Server

Version of Immich Server

v1.117.0

Version of Immich Mobile App

v1.117.0

Platform with the issue

My values.yml content

# This chart relies on the common library chart from bjw-s
## You can find it at https://github.com/bjw-s/helm-charts/tree/main/charts/library/common
## Refer there for more detail about the supported values

# These entries are shared between all the Immich components

env:
  REDIS_HOSTNAME: common-redis-master
  DB_HOSTNAME: common-postgresql
  DB_DATABASE_NAME: immich
  DB_USERNAME:
    valueFrom:
      secretKeyRef:
        name: immich-postgres
        key: username
  DB_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: immich-postgres
        key: password
  IMMICH_MACHINE_LEARNING_URL: '{{ printf "http://%s-machine-learning:3003" .Release.Name }}'
  IMMICH_LOG_LEVEL: 'verbose'

image:
  tag: v1.117.0

immich:
  metrics:
    # Enabling this will create the service monitors needed to monitor immich with the prometheus operator
    enabled: false
  persistence:
    library:
      existingClaim: immich-pvc
      size: 3Ti
  configuration:
    #trash:
    #  enabled: false
    #  days: 30
    #storageTemplate:
    #  enabled: true
    #  template: "{{y}}/{{y}}-{{MM}}-{{dd}}/{{filename}}"

# Dependencies

postgresql:
  enabled: false

redis:
  enabled: false

# Immich components

server:
  enabled: true
  image:
    repository: ghcr.io/immich-app/immich-server
    pullPolicy: IfNotPresent
  ingress:
    main:
      enabled: false

machine-learning:
  enabled: true
  image:
    repository: ghcr.io/immich-app/immich-machine-learning
    pullPolicy: IfNotPresent
  env:
    TRANSFORMERS_CACHE: /cache
  persistence:
    cache:
      enabled: true
      size: 10Gi
      # Optional: Set this to pvc to avoid downloading the ML models every start.
      type: emptyDir
      accessMode: ReadWriteMany
      # storageClass: your-class

Reproduction steps

Upgrade Immich Chart from v1.116.2 to v1.117.0 and helm-upgrade.

Logs (set to verbose):


# Immich Server

DEBUG: cgroup v2 detected.
DEBUG: No CPU limits set.
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7  - 10/09/2024, 2:48:42 PM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7  - 10/09/2024, 2:48:42 PM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17  - 10/09/2024, 2:48:42 PM     LOG [Api:EventRepository] Initialized websocket server

# ...then it stops and the pod restarts
# Immich Machine Learning

[10/09/24 14:48:37] INFO     Starting gunicorn 23.0.0                           
[10/09/24 14:48:37] INFO     Listening at: http://[::]:3003 (9)                 
[10/09/24 14:48:37] INFO     Using worker: app.config.CustomUvicornWorker       
[10/09/24 14:48:37] INFO     Booting worker with pid: 10                        
[10/09/24 14:48:38] DEBUG    Could not load ANN shared libraries, using ONNX:   
                             libmali.so: cannot open shared object file: No such
                             file or directory                                  
[10/09/24 14:48:42] INFO     Started server process [10]                        
[10/09/24 14:48:42] INFO     Waiting for application startup.                   
[10/09/24 14:48:42] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[10/09/24 14:48:42] INFO     Initialized request thread pool with 4 threads.    
[10/09/24 14:48:42] DEBUG    Checking for inactivity...                         
[10/09/24 14:48:42] INFO     Application startup complete.                      
[10/09/24 14:48:52] DEBUG    Checking for inactivity...                         
[10/09/24 14:49:02] DEBUG    Checking for inactivity...                         
[10/09/24 14:49:12] DEBUG    Checking for inactivity...                
alextran1502 commented 3 weeks ago

immich microservices is no longer exist for a while, at least from the docker-compose standpoint. I am not very sure on helm-chart

PassionateBytes commented 3 weeks ago

immich microservices is no longer exist for a while, at least from the docker-compose standpoint. I am not very sure on helm-chart

My apologies, and good catch! - I mistypted that. The pod isn't Microservices one. I meant to write Machine-Learning. Let me fix that in my original post.

Also, yes the microservices section in the values.yaml is ineffective now too, since that change a few versions back. Back then I did update the chart's template files to eliminate the microservices pod when that got obsolete. I must've just forgotten to remove the section from the values.yaml file. but without the corresponding templates this section doesn't do anything anyway. I'll remove it fro my post to avoid future confusion!

PassionateBytes commented 3 weeks ago

I found the issue myself - @alextran1502 your note pointed me in the right direction. The last time I made changes to the chart templates has been back when the microservices became obsolete. I checked the changes in recent releases of the chart and realized that they up'ed the failure threshold on the startup readiness probes, to provide more time during the server pod startup, before kubernetes would kill it and restart. (This is the change in question) - I added this startup probe configuration to my setup which resolved the issue. The Server pod has more time to startup and doesn't get killed off too early now.

Thanks!