machine-learning needs less stress from the liveness probes in order to start properly

alexbarcelo commented 1 year ago

Maybe I am doing something wrong, but I was looking at the errors on a clean install (latest version 0.1.1 with image tag v1.60.0). It seems that the liveness probe is triggering a restart before things are able to initialize themselves.

The logs for the machine-learning pods shows me that it is downloading the pytorch-model, and the pod is restarted before being able to finish this download.

I haven't seen any values.yaml related to the timeout or the livenessProbe. I suppose that once that after the first time, because I activated persistence, it will be faster. But I cannot reach that point because I am doing a clean install.

PixelJonas commented 1 year ago

Can you try increasing the timeout for the livenessProbe for the machine-learning container?

you can set it via

machine-learning:
  probes:
    liveness:
      spec:
        initialDelaySeconds: 60

You can find the default values here https://github.com/immich-app/immich-charts/blob/main/charts/immich/templates/machine-learning.yaml

alexbarcelo commented 12 months ago

Sorry, I read this when you commented but I just had succeeded in starting everything, so I wasn't ready to break my deployment for testing. Apologies.

Today I upgraded to v1.75.2 and experienced similar issues. I can confirm that relaxing the probes (adding initialDelaySeconds and increasing the periodSeconds) seemed to be beneficial.

The issues were in both immich-server and immich-machine-learning (which I assume are the most sluggish deployment in the chart, so it makes sense). The immich-server started as soon as I relaxed the livenessProbe. The machine learning was downloading stuff (because the upgrade lost the PVC as I was tweaking and cleaning stuff) so I removed the probe entirely to let it initialize properly. Having to do so seems unfortunate, but I see in the other issue that I am not the only one.

Would it be possible to do the download in an initContainer, so the probes become useful again? (although they may need to relax a bit to accomodate kubernetes clusters modest as mine). I am a bit rusty and maybe that doesn't make sense (or maybe it is not easy considering immich architecture? dunno).

bo0tzz commented 12 months ago

The ML system has been changed recently and should no longer download the models at startup (right @mertalev?). In theory that should alleviate this issue entirely.

mertalev commented 12 months ago

That's right, the ML service currently doesn't load or download models at startup. I might re-add downloading at startup at some point, but models won't be loaded either way so this shouldn't stress the server.

immich-app / immich-charts

machine-learning needs less stress from the liveness probes in order to start properly #27