Closed morotsgurka closed 4 weeks ago
I think the pod is just taking a long time to start. You might want to try making the readiness timeout a bit longer.
Hi!
Not sure if this was how you meant but I increased the readiness and startup probes by modifying the server values:
# Immich components
server:
enabled: true
probes:
readiness:
custom: true
spec:
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startup:
custom: true
spec:
initialDelaySeconds: 0
timeoutSeconds: 5
## This means it has a maximum of 5*30=150 seconds to start up before it fails
periodSeconds: 10
failureThreshold: 60
By inspecting the pod I can see that the changes are reflected on the pod:
Liveness
http-get http://:3001/api/server/ping delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness
http-get http://:3001/api/server/ping delay=30s timeout=5s period=10s #success=1 #failure=3
Startup
http-get http://:3001/api/server/ping delay=0s timeout=5s period=10s #success=1 #failure=60
But i still get
Startup probe failed: Get "http://10.42.2.208:3001/api/server/ping": dial tcp 10.42.2.208:3001: connect: connection refused
All i see in the pod logs is still:
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7 - 10/05/2024, 10:09:17 AM LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7 - 10/05/2024, 10:09:17 AM LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17 - 10/05/2024, 10:09:17 AM LOG [Api:EventRepository] Initialized websocket server
I'm having a similar issue as well
Warning Unhealthy <invalid> (x3 over <invalid>) kubelet Startup probe failed: Get "http://10.42.2.120:3001/api/server/ping": dial tcp 10.42.2.120:3001: connect: connection refused Warning Unhealthy <invalid> kubelet Startup probe failed: HTTP probe failed with statuscode: 404
Digging into the logs I see
[Nest] 17 - 10/05/2024, 2:20:02 PM LOG [Api:Bootstrap] Immich Server is listening on http://[::1]:3001 [v1.106.1] [PRODUCTION]
Error: connect ECONNREFUSED 10.43.247.217:6379
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1606:16) {
errno: -111,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '10.43.247.217',
port: 6379
}
Error: connect ETIMEDOUT
at Socket.
Seemingly issues with redis. Granted, perhaps I'm setting up redis wrong and this is different than your issue. But I don't think so. My redis pods are starting fine and I'm deploying it along with the rest of the helm chart. So should be what's provided by default.
I had similar, if not the same errors about Redis. But I think that was just initially before the Redis pod had started? Now if I evict the Immich server pod I don't see any Redis errors.
Good call @morotsgurka. I think you're right. They went away for me too. App still not starting of course. Presumably due to to the probes.
same, reverting to tag v1.116.0 is fine, however same error if v1.117.0
same, reverting to tag v1.116.0 is fine, however same error if v1.117.0
Can confirm. Added image tag of 1.116 and now it starts
Same issue here:
│ Detected CPU Cores: 3 │
│ Starting api worker │
│ Starting microservices worker │
│ [Nest] 7 - 10/06/2024, 8:59:42 PM LOG [Microservices:EventRepository] Initialized websocket server │
│ [Nest] 16 - 10/06/2024, 8:59:43 PM LOG [Api:EventRepository] Initialized websocket server │
│ [Nest] 7 - 10/06/2024, 8:59:51 PM LOG [Microservices:MapRepository] Initializing metadata repository
Using Chart 0.8.1 and 1.107.0
I have same issue ,now I rolle to tag v1.116.0
Same issue here! Last working tag for me is v1.116.2
Readiness probe fails because the immich-api process does not listen on TCP port 3001. I don't know why.
I have two immich instances on one kubernetes node. Old instance is running v1.116.2, whereas the new is trying to start with v1.117.0. The new instance does not listen on TCP/3001, although the liveness probe timeout was increased from 30 to 690 seconds, to prevent killing the pod too early.
ps auxw | grep immich | grep -vE 'postgres|grep'
# root 52171 11.7 2.0 22835356 325564 ? Sl 17:48 0:19 immich
# root 52227 4.7 1.0 11648044 174408 ? Sl 17:48 0:07 immich-api
# root 2082901 0.2 1.6 22981404 268440 ? Sl Okt02 31:55 immich
# root 2083038 0.0 0.9 11658532 148792 ? Sl Okt02 8:37 immich-api
nsenter --net --target 2082901
# We are in the linux network namespace of the immich-server v1.116.2 container.
ss -apn | grep LISTEN
# tcp LISTEN 0 511 *:8082 *:* users:(("immich",pid=2082901,fd=24))
# tcp LISTEN 0 511 *:8081 *:* users:(("immich-api",pid=2083038,fd=19))
# tcp LISTEN 0 511 *:3001 *:* users:(("immich-api",pid=2083038,fd=39))
# tcp LISTEN 0 511 *:33673 *:* users:(("immich",pid=2082901,fd=72))
exit # back to host NS.
nsenter --net --target 40560
ss -apn | grep LISTEN
# We are in the linux network namespace of the immich-server v1.117.0 container.
# tcp LISTEN 0 511 *:8081 *:* users:(("immich-api",pid=52227,fd=19))
# tcp LISTEN 0 511 *:8082 *:* users:(("immich",pid=52171,fd=24))
The v1.117.0 logs are pretty short. Both v1.117.0 and v1.116.2 logs are given below.
Setting the env.IMMICH_IGNORE_MOUNT_CHECK_ERRORS: true
helm value did not help.
Edit: setting env.IMMICH_LOG_LEVEL: debug
or env.IMMICH_LOG_LEVEL: verbose
does not change the log output. The same 3 messages are printed, then nothing.
Edit 2: when the container is killed, it prints out the fourth log line: [Nest] 17 - 10/10/2024, 5:42:47 PM LOG [Api:MapRepository] Initializing metadata repository
. It is probably caused by the event log caching.
# kubectl -n immich1 logs deploy/immich-server
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 6 - 10/10/2024, 3:53:23 PM LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 6 - 10/10/2024, 3:53:23 PM LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 16 - 10/10/2024, 3:53:23 PM LOG [Api:EventRepository] Initialized websocket server
# kubectl -n immich2 logs deploy/immich-server | less -R
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 8 - 10/02/2024, 10:51:46 AM LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 18 - 10/02/2024, 10:51:46 AM LOG [Api:EventRepository] Initialized websocket server
[Nest] 8 - 10/02/2024, 10:51:47 AM LOG [Microservices:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18 - 10/02/2024, 10:51:47 AM LOG [Api:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18 - 10/02/2024, 10:51:47 AM LOG [Api:ServerService] Feature Flags: { "smartSearch": true, "facialRecognition": true, "duplicateDetection": true, "map": true, "reverseGeocoding": true, "importFaces": false, "sidecar": true, "search": true, "trash": true, "oauth": false, "oauthAutoLaunch": false, "passwordLogin": true, "configFile": false, "email": false }
[Nest] 8 - 10/02/2024, 10:51:47 AM LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 18 - 10/02/2024, 10:51:47 AM LOG [Api:StorageService] Verifying system mount folder checks (enabled=false)
[Nest] 18 - 10/02/2024, 10:51:47 AM LOG [Api:StorageService] Writing initial mount file for the encoded-video folder
etc.
Hm, something big has changed there. Previously, the MapRepository
was under Microservices
, but in v1.107.0 it became Api:MapRepository
, though the MapRepository code itself did not change since 3 month.
Could it be the reason for the startup issue?
There are only 34 commits changing the server. Someone has to bisect, build and deploy them :)
git log v1.106.2..v1.107.0 | grep server -B4 | grep '\--' |wc -l
34
I also experienced the same issue as I tried to install immich for the first time. All pods start except for immich-server, which is continuously unhealthy due to startup probe failure:
Startup probe failed: Get "http://10.42.1.21:3001/api/server/ping": dial tcp 10.42.1.21:3001: connect: connection refuse
In the logs for the server pod, there's lots of this:
Error: connect ETIMEDOUT
at Socket.<anonymous> (/usr/src/app/node_modules/ioredis/built/Redis.js:170:41)
at Object.onceWrapper (node:events:633:28)
at Socket.emit (node:events:519:28)
at Socket._onTimeout (node:net:591:8)
at listOnTimeout (node:internal/timers:581:17)
at process.processTimers (node:internal/timers:519:7) {
errorno: 'ETIMEDOUT',
code: 'ETIMEDOUT',
syscall: 'connect'
}
I've confirmed that the same errors happen for me on v1.117.0 and v1.116.2.
@cconcannon have you enabled the Redis pods? This issue seems similar to earlier comments in this thread? I also had similar issues but only on first boot before redis was initialized. And the issue we seem to be having is not present on 1.116
same issue here with startup probe and helm chart 0.8.1 and immich v1.117.0. setting image.tag
to v1.116.0
fixes it.
@morotsgurka yes I enabled Redis and Redis starts successfully. I see Redis, Postgres, and Machine Learning pods all successfully start. Server continues to have errors, even after I try deleting it.
Ive had continuous problems with the probes and completely disabled them. That has helped in the past. But it hangs indefinitely for me on LOG [Api:EventRepository] Initialized websocket server
for 1.117 never coming online. Once I drop the tag to 1.116.2, everything works fine.
I spent quite a while trying to fix this. To the point of taking a DB backup, deploying a standalone postgresql, and reinstalling 116, then upgrading again to 117. What I ultimately determined (for me at least) was that immich leaves the postgres resources as the default (nano
: https://github.com/bitnami/charts/blob/main/bitnami/postgresql/values.yaml#L465C3-L470C26), which is quite low. After upping postgresql to large
I saw more than just the three logs in immich-server
and things started working. You may not need large but I have plenty of compute to throw at it so I did.
My suspiscion is that the DB migration required for 117 (or something) takes way too long on the nano and the probes cause a restart, causing the DB migration to kick off again (or something).
@rjbez17 I verified this solution, because the postgres container resources are too small, set the resourcesPreset of postgresql: "large", and immich-server starts normally.
I run into the same problem when using an PG DB outside a container. Edit: Forget about that. Idiot me forgot to update the chart.
I was pretty stumped by this issue, but in hindsight the postgres resources is obvious 🤦 With that I'll close this issue. I'm a bit hesitant to set a higher default resources as it might make the pod unschedulable for some, though leaving it this low isn't really feasible either. I'll have a think about that.
While I have you all here: I'd love your feedback on #129.
Setting postgresql.primary.resourcesPreset: large
did not help. The database received a flow of geodata INSERT SQL requests, which consumed ca. 15% of one CPU core and around 100MB RAM. Unfortunately, the immich-api
service did not start up during the default 5 minutes timeout of the startupProbe.
If the startup probe never succeeds, the container is killed...
Finally, it helped to increase the startupProbe
timeout to 1 hour:
server:
probes:
startup:
spec:
failureThreshold: 360
So, this issue should have been called Startup probe failed
, probably due to slow import of geo-data.
Ugh glad I found this. Didn't realize that the bitnami chart was adding CPU limits to my postgres pod. I would highly recommend at least suggesting to people using that chart to set it to a higher preset for better performance. I've been having awful performance lately and wasn't sure why, thought my k8s cluster was just underpowered for immich.
Some light reading for those interested on why CPU limits are bad
Hi!
I'm running k3s on 3 nodes together with Longhorn and FluxCD. immich-chart version: 0.8.1
Found this issue on the immich repo with similar problem, but the problem there seemed to be that he had not enabled postgresql, but I have.
This is my values file:
I have created volumes in Longhorn with PVC's for each claim. They all attach normally. I can see that all pods are running, except the immich-server, which fails to give a readiness probe. If I check the pod logs I just see:
For the immich-postgresql-secret I have just created a generic secret in the same namespace like: