immich-app / immich-charts

Helm chart implementation of Immich
https://immich.app
GNU Affero General Public License v3.0
130 stars 53 forks source link

Readiness probe failed #130

Closed morotsgurka closed 4 weeks ago

morotsgurka commented 1 month ago

Hi!

I'm running k3s on 3 nodes together with Longhorn and FluxCD. immich-chart version: 0.8.1

Found this issue on the immich repo with similar problem, but the problem there seemed to be that he had not enabled postgresql, but I have.

This is my values file:

env:
  DB_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: password
  DB_DATABASE_NAME:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: database
  DB_USERNAME:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: username

immich:
  metrics:
    # Enabling this will create the service monitors needed to monitor immich with the prometheus operator
    enabled: false
  persistence:
    # Main data store for all photos shared between different components.
    library:
      existingClaim: immich-library-pvc-claim

# Dependencies

postgresql:
  enabled: true
  global:
    postgresql:
      auth:
        existingSecret: immich-postgresql-secret
  primary:
    persistence:
      existingClaim: immich-postgresql-pvc-claim
    containerSecurityContext:
      readOnlyRootFilesystem: false
    initdb:
      scripts:
        create-extensions.sql: |
          CREATE EXTENSION cube;
          CREATE EXTENSION earthdistance;
          CREATE EXTENSION vectors;

redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: false
  master:
    persistence:
      existingClaim: immich-redis-pvc-claim

# Immich components
server:
  enabled: true

machine-learning:
  enabled: true
  env:
    TRANSFORMERS_CACHE: /cache
  persistence:
    cache:
      enabled: true
      # Optional: Set this to pvc to avoid downloading the ML models every start.
      type: pvc
      existingClaim: immich-ml-cache-pvc-claim

I have created volumes in Longhorn with PVC's for each claim. They all attach normally. I can see that all pods are running, except the immich-server, which fails to give a readiness probe. If I check the pod logs I just see:

Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7  - 10/04/2024, 10:30:08 PM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7  - 10/04/2024, 10:30:08 PM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17  - 10/04/2024, 10:30:08 PM     LOG [Api:EventRepository] Initialized websocket server

For the immich-postgresql-secret I have just created a generic secret in the same namespace like:

kubectl create secret generic immich-postgresql-secret -n immich --from-literal=postgres-password="cool-secret-password" --from-literal=password="cool-secret-password" --from-literal=username="immich" --from-literal=database="immich"
bo0tzz commented 1 month ago

I think the pod is just taking a long time to start. You might want to try making the readiness timeout a bit longer.

morotsgurka commented 1 month ago

Hi!

Not sure if this was how you meant but I increased the readiness and startup probes by modifying the server values:

# Immich components
server:
  enabled: true
  probes:
    readiness:
      custom: true
      spec:
        initialDelaySeconds: 30
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
    startup:
      custom: true
      spec:
        initialDelaySeconds: 0
        timeoutSeconds: 5
        ## This means it has a maximum of 5*30=150 seconds to start up before it fails
        periodSeconds: 10
        failureThreshold: 60 

By inspecting the pod I can see that the changes are reflected on the pod:

Liveness
http-get http://:3001/api/server/ping delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness
http-get http://:3001/api/server/ping delay=30s timeout=5s period=10s #success=1 #failure=3
Startup
http-get http://:3001/api/server/ping delay=0s timeout=5s period=10s #success=1 #failure=60

But i still get

Startup probe failed: Get "http://10.42.2.208:3001/api/server/ping": dial tcp 10.42.2.208:3001: connect: connection refused

All i see in the pod logs is still:

Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7  - 10/05/2024, 10:09:17 AM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7  - 10/05/2024, 10:09:17 AM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17  - 10/05/2024, 10:09:17 AM     LOG [Api:EventRepository] Initialized websocket server
madeofpendletonwool commented 1 month ago

I'm having a similar issue as well Warning Unhealthy <invalid> (x3 over <invalid>) kubelet Startup probe failed: Get "http://10.42.2.120:3001/api/server/ping": dial tcp 10.42.2.120:3001: connect: connection refused Warning Unhealthy <invalid> kubelet Startup probe failed: HTTP probe failed with statuscode: 404 Digging into the logs I see

[Nest] 17 - 10/05/2024, 2:20:02 PM LOG [Api:Bootstrap] Immich Server is listening on http://[::1]:3001 [v1.106.1] [PRODUCTION] Error: connect ECONNREFUSED 10.43.247.217:6379 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1606:16) { errno: -111, code: 'ECONNREFUSED', syscall: 'connect', address: '10.43.247.217', port: 6379 } Error: connect ETIMEDOUT at Socket. (/usr/src/app/node_modules/ioredis/built/Redis.js:170:41) at Object.onceWrapper (node:events:633:28) at Socket.emit (node:events:519:28) at Socket._onTimeout (node:net:590:8) at listOnTimeout (node:internal/timers:573:17) at process.processTimers (node:internal/timers:514:7) { errorno: 'ETIMEDOUT', code: 'ETIMEDOUT', syscall: 'connect' }

Seemingly issues with redis. Granted, perhaps I'm setting up redis wrong and this is different than your issue. But I don't think so. My redis pods are starting fine and I'm deploying it along with the rest of the helm chart. So should be what's provided by default.

morotsgurka commented 1 month ago

I had similar, if not the same errors about Redis. But I think that was just initially before the Redis pod had started? Now if I evict the Immich server pod I don't see any Redis errors.

madeofpendletonwool commented 1 month ago

Good call @morotsgurka. I think you're right. They went away for me too. App still not starting of course. Presumably due to to the probes.

AlexBevan commented 1 month ago

same, reverting to tag v1.116.0 is fine, however same error if v1.117.0

morotsgurka commented 1 month ago

same, reverting to tag v1.116.0 is fine, however same error if v1.117.0

Can confirm. Added image tag of 1.116 and now it starts

joaopedrocg27 commented 1 month ago

Same issue here:

│ Detected CPU Cores: 3                                                                                                                                                                                          │
│ Starting api worker                                                                                                                                                                                            │
│ Starting microservices worker                                                                                                                                                                                  │
│ [Nest] 7  - 10/06/2024, 8:59:42 PM     LOG [Microservices:EventRepository] Initialized websocket server                                                                                                        │
│ [Nest] 16  - 10/06/2024, 8:59:43 PM     LOG [Api:EventRepository] Initialized websocket server                                                                                                                 │
│ [Nest] 7  - 10/06/2024, 8:59:51 PM     LOG [Microservices:MapRepository] Initializing metadata repository  

Using Chart 0.8.1 and 1.107.0

zlq1503 commented 1 month ago

I have same issue ,now I rolle to tag v1.116.0

boredomwontgetus commented 1 month ago

Same issue here! Last working tag for me is v1.116.2

kabakaev commented 1 month ago

Readiness probe fails because the immich-api process does not listen on TCP port 3001. I don't know why.

I have two immich instances on one kubernetes node. Old instance is running v1.116.2, whereas the new is trying to start with v1.117.0. The new instance does not listen on TCP/3001, although the liveness probe timeout was increased from 30 to 690 seconds, to prevent killing the pod too early.

ps auxw | grep immich | grep -vE 'postgres|grep'
# root       52171 11.7  2.0 22835356 325564 ?     Sl   17:48   0:19 immich
# root       52227  4.7  1.0 11648044 174408 ?     Sl   17:48   0:07 immich-api
# root     2082901  0.2  1.6 22981404 268440 ?     Sl   Okt02  31:55 immich
# root     2083038  0.0  0.9 11658532 148792 ?     Sl   Okt02   8:37 immich-api

nsenter --net --target 2082901
# We are in the linux network namespace of the immich-server v1.116.2 container.
ss -apn | grep LISTEN
# tcp   LISTEN 0      511                *:8082                *:*        users:(("immich",pid=2082901,fd=24))
# tcp   LISTEN 0      511                *:8081                *:*        users:(("immich-api",pid=2083038,fd=19))
# tcp   LISTEN 0      511                *:3001                *:*        users:(("immich-api",pid=2083038,fd=39))
# tcp   LISTEN 0      511                *:33673               *:*        users:(("immich",pid=2082901,fd=72))
exit # back to host NS.
nsenter --net --target 40560
ss -apn | grep LISTEN
# We are in the linux network namespace of the immich-server v1.117.0 container.
# tcp   LISTEN 0      511                *:8081                 *:*         users:(("immich-api",pid=52227,fd=19))
# tcp   LISTEN 0      511                *:8082                 *:*         users:(("immich",pid=52171,fd=24))

The v1.117.0 logs are pretty short. Both v1.117.0 and v1.116.2 logs are given below.

Setting the env.IMMICH_IGNORE_MOUNT_CHECK_ERRORS: true helm value did not help.

Edit: setting env.IMMICH_LOG_LEVEL: debug or env.IMMICH_LOG_LEVEL: verbose does not change the log output. The same 3 messages are printed, then nothing.

Edit 2: when the container is killed, it prints out the fourth log line: [Nest] 17 - 10/10/2024, 5:42:47 PM LOG [Api:MapRepository] Initializing metadata repository. It is probably caused by the event log caching.

# kubectl -n immich1 logs deploy/immich-server
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 6  - 10/10/2024, 3:53:23 PM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 6  - 10/10/2024, 3:53:23 PM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 16  - 10/10/2024, 3:53:23 PM     LOG [Api:EventRepository] Initialized websocket server

# kubectl -n immich2 logs deploy/immich-server | less -R
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 8  - 10/02/2024, 10:51:46 AM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 18  - 10/02/2024, 10:51:46 AM     LOG [Api:EventRepository] Initialized websocket server
[Nest] 8  - 10/02/2024, 10:51:47 AM     LOG [Microservices:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:ServerService] Feature Flags: {  "smartSearch": true, "facialRecognition": true, "duplicateDetection": true, "map": true, "reverseGeocoding": true, "importFaces": false, "sidecar": true, "search": true, "trash": true, "oauth": false, "oauthAutoLaunch": false, "passwordLogin": true, "configFile": false, "email": false }
[Nest] 8  - 10/02/2024, 10:51:47 AM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:StorageService] Verifying system mount folder checks (enabled=false)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:StorageService] Writing initial mount file for the encoded-video folder
etc.
kabakaev commented 1 month ago

Hm, something big has changed there. Previously, the MapRepository was under Microservices, but in v1.107.0 it became Api:MapRepository, though the MapRepository code itself did not change since 3 month. Could it be the reason for the startup issue?

kabakaev commented 1 month ago

There are only 34 commits changing the server. Someone has to bisect, build and deploy them :)

git log v1.106.2..v1.107.0 | grep server -B4 | grep '\--' |wc -l
34
cconcannon commented 1 month ago

I also experienced the same issue as I tried to install immich for the first time. All pods start except for immich-server, which is continuously unhealthy due to startup probe failure:

Startup probe failed: Get "http://10.42.1.21:3001/api/server/ping": dial tcp 10.42.1.21:3001: connect: connection refuse

In the logs for the server pod, there's lots of this:

Error: connect ETIMEDOUT
    at Socket.<anonymous> (/usr/src/app/node_modules/ioredis/built/Redis.js:170:41)
    at Object.onceWrapper (node:events:633:28)
    at Socket.emit (node:events:519:28)
    at Socket._onTimeout (node:net:591:8)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

I've confirmed that the same errors happen for me on v1.117.0 and v1.116.2.

morotsgurka commented 1 month ago

@cconcannon have you enabled the Redis pods? This issue seems similar to earlier comments in this thread? I also had similar issues but only on first boot before redis was initialized. And the issue we seem to be having is not present on 1.116

revolunet commented 1 month ago

same issue here with startup probe and helm chart 0.8.1 and immich v1.117.0. setting image.tag to v1.116.0 fixes it.

cconcannon commented 1 month ago

@morotsgurka yes I enabled Redis and Redis starts successfully. I see Redis, Postgres, and Machine Learning pods all successfully start. Server continues to have errors, even after I try deleting it.

evanreichard commented 1 month ago

Ive had continuous problems with the probes and completely disabled them. That has helped in the past. But it hangs indefinitely for me on LOG [Api:EventRepository] Initialized websocket server for 1.117 never coming online. Once I drop the tag to 1.116.2, everything works fine.

rjbez17 commented 1 month ago

I spent quite a while trying to fix this. To the point of taking a DB backup, deploying a standalone postgresql, and reinstalling 116, then upgrading again to 117. What I ultimately determined (for me at least) was that immich leaves the postgres resources as the default (nano: https://github.com/bitnami/charts/blob/main/bitnami/postgresql/values.yaml#L465C3-L470C26), which is quite low. After upping postgresql to large I saw more than just the three logs in immich-server and things started working. You may not need large but I have plenty of compute to throw at it so I did.

My suspiscion is that the DB migration required for 117 (or something) takes way too long on the nano and the probes cause a restart, causing the DB migration to kick off again (or something).

zlq1503 commented 4 weeks ago

@rjbez17 I verified this solution, because the postgres container resources are too small, set the resourcesPreset of postgresql: "large", and immich-server starts normally.

boredomwontgetus commented 4 weeks ago

I run into the same problem when using an PG DB outside a container. Edit: Forget about that. Idiot me forgot to update the chart.

bo0tzz commented 4 weeks ago

I was pretty stumped by this issue, but in hindsight the postgres resources is obvious 🤦 With that I'll close this issue. I'm a bit hesitant to set a higher default resources as it might make the pod unschedulable for some, though leaving it this low isn't really feasible either. I'll have a think about that.

While I have you all here: I'd love your feedback on #129.

kabakaev commented 4 weeks ago

Setting postgresql.primary.resourcesPreset: large did not help. The database received a flow of geodata INSERT SQL requests, which consumed ca. 15% of one CPU core and around 100MB RAM. Unfortunately, the immich-api service did not start up during the default 5 minutes timeout of the startupProbe.

If the startup probe never succeeds, the container is killed...

Finally, it helped to increase the startupProbe timeout to 1 hour:

server:
  probes:
    startup:
      spec:
        failureThreshold: 360

So, this issue should have been called Startup probe failed, probably due to slow import of geo-data.

astr0n8t commented 3 weeks ago

Ugh glad I found this. Didn't realize that the bitnami chart was adding CPU limits to my postgres pod. I would highly recommend at least suggesting to people using that chart to set it to a higher preset for better performance. I've been having awful performance lately and wasn't sure why, thought my k8s cluster was just underpowered for immich.

Some light reading for those interested on why CPU limits are bad