immich-app / immich-charts

Helm chart implementation of Immich
https://immich.app
GNU Affero General Public License v3.0
92 stars 39 forks source link

Immich not completing startup on 1.106.4 using chart 0.7.0 #103

Closed btajuddin closed 4 weeks ago

btajuddin commented 4 weeks ago

When I tried to upgrade to Immich 1.106.4, I couldn't get the server to start properly. When I logged into the container, it was not even listening on port 3001, only 8081 and 8082. I've dug around a bit in the Immich code, but I'm not seeing exactly what is broken.

This is all the logs I get:

Detected CPU Cores: 40
Starting api worker
Starting microservices worker
[Nest] 7 - 06/19/2024, 4:20:54 PM LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 17 - 06/19/2024, 4:20:54 PM LOG [Api:EventRepository] Initialized websocket server

Values file:

env:
  DB_PASSWORD: [redacted]
  DB_HOSTNAME: postgres-rw
  REDIS_HOSTNAME: redis-master

machine-learning:
  resources:
    requests:
      cpu: "2"
      memory: 4Gi
  persistence:
    cache:
      type: pvc
      storageClass: longhorn
  probes:
    liveness:
      spec:
        periodSeconds: 60

immich:
  metrics:
    enabled: false
  persistence:
    library:
      existingClaim: immich

postgresql:
  enabled: false

redis:
  enabled: false

server:
  resources:
    requests:
      cpu: "2"
      memory: 4Gi
  service:
    main:
      annotations:
        tailscale.com/expose: "true"
bo0tzz commented 4 weeks ago

Does it just sit there doing nothing, or is the pod getting killed and such? Are you maybe running into #102?

btajuddin commented 4 weeks ago

I saw that one and set the probe delays to 1800 seconds so that I could log into the container to see what was happening. It just seemed to be sitting there. I also installed curl and netstat so that I could try to hit the healthcheck and see if the port was open. 3001 was not listening according to netstat.

bo0tzz commented 4 weeks ago

Can you set IMMICH_LOG_LEVEL=verbose and see if anything more comes out?

btajuddin commented 4 weeks ago

I had downgraded back to 1.105 earlier. When I re-upgraded and upped the logging with that variable, I found a database connection issue. With the increased logging, the microservices container was actually failing and exiting. I'm not sure how to reproduce this, but there might be an error case that's getting swallowed when the log level is lower. Unfortunately, I didn't think to grab the log when it was on my screen, and the old pod logs are gone now.

Restarting my DB pod fixed it, and everything is working now. Thanks for the help!

bo0tzz commented 4 weeks ago

I can't reproduce any log level issues, but what I am seeing is that logs seem to get buffered and only written out all at once when the process goes down.

dvystrcil commented 2 weeks ago

I am seeing this as well. When I edit the deployment in ArgoCD to use initialDelaySeconds: 120 then allows the container the time it needs to come to a healthy state.