Open djjudas21 opened 6 months ago
I think the probes for the ML deployment don't quite work right anyways, and if a model takes too long to load (or if it's being downloaded at startup through the preload env var) I've had my non-openvino instances occasionally get killed as well. It could be that the openvino image just takes slightly longer to load things, thus exposing that issue?
I'll have a play with the probes and see if I can get it more stable. The default values look pretty aggressive IMO. Timeout of 1s, period of 10s. If a probe fails, it gets retried immediately - so the liveness probe could actually kill the container after just 3 seconds, which is insane.
livenessProbe:
failureThreshold: 3
httpGet:
path: /ping
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Eww, they're hard-coded... https://github.com/immich-app/immich-charts/blob/main/charts/immich/templates/machine-learning.yaml#L28
OK so I manually configured the probes to have longer timeouts and a great tolerance of failed probes. It didn't get killed by Kubernetes, but it crashed in a nasty way. I'm well out of my depth with this
[jonathan@poseidon immich]$ kubectl logs -f immich-machine-learning-78f87fcc4b-5s84r
[04/19/24 15:18:49] INFO Starting gunicorn 21.2.0
[04/19/24 15:18:49] INFO Listening at: http://[::]:3003 (10)
[04/19/24 15:18:49] INFO Using worker: app.config.CustomUvicornWorker
[04/19/24 15:18:49] INFO Booting worker with pid: 14
[04/19/24 15:18:52] INFO Started server process [14]
[04/19/24 15:18:52] INFO Waiting for application startup.
[04/19/24 15:18:52] INFO Created in-memory cache with unloading after 300s
of inactivity.
[04/19/24 15:18:52] INFO Initialized request thread pool with 4 threads.
[04/19/24 15:18:52] INFO Application startup complete.
[04/19/24 15:18:54] INFO Setting 'ViT-B-32__openai' execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[04/19/24 15:18:54] INFO Loading clip model 'ViT-B-32__openai' to memory
[04/19/24 15:18:54] INFO Setting 'buffalo_l' execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
[04/19/24 15:20:19] INFO Loading facial recognition model 'buffalo_l' to
memory
[04/19/24 15:21:11] ERROR Exception in ASGI application
╭─────── Traceback (most recent call last) ───────╮
│ /usr/src/app/main.py:118 in predict │
│ │
│ 115 │ │
│ 116 │ model = await load(await model_cache. │
│ ttl=settings.model_ttl, **kwargs)) │
│ 117 │ model.configure(**kwargs) │
│ ❱ 118 │ outputs = await run(model.predict, in │
│ 119 │ return ORJSONResponse(outputs) │
│ 120 │
│ 121 │
│ │
│ /usr/src/app/main.py:125 in run │
│ │
│ 122 async def run(func: Callable[..., Any], i │
│ 123 │ if thread_pool is None: │
│ 124 │ │ return func(inputs) │
│ ❱ 125 │ return await asyncio.get_running_loop │
│ 126 │
│ 127 │
│ 128 async def load(model: InferenceModel) -> │
│ │
│ /usr/lib/python3.10/concurrent/futures/thread.p │
│ y:58 in run │
│ │
│ /usr/src/app/models/base.py:59 in predict │
│ │
│ 56 │ │ self.load() │
│ 57 │ │ if model_kwargs: │
│ 58 │ │ │ self.configure(**model_kwargs │
│ ❱ 59 │ │ return self._predict(inputs) │
│ 60 │ │
│ 61 │ @abstractmethod │
│ 62 │ def _predict(self, inputs: Any) -> An │
│ │
│ /usr/src/app/models/facial_recognition.py:49 in │
│ _predict │
│ │
│ 46 │ │ else: │
│ 47 │ │ │ decoded_image = image │
│ 48 │ │ assert is_ndarray(decoded_image, n │
│ ❱ 49 │ │ bboxes, kpss = self.det_model.dete │
│ 50 │ │ if bboxes.size == 0: │
│ 51 │ │ │ return [] │
│ 52 │ │ assert is_ndarray(kpss, np.float32 │
│ │
│ /opt/venv/lib/python3.10/site-packages/insightf │
│ ace/model_zoo/retinaface.py:224 in detect │
│ │
│ 221 │ │ det_img = np.zeros( (input_size[1 │
│ 222 │ │ det_img[:new_height, :new_width, │
│ 223 │ │ │
│ ❱ 224 │ │ scores_list, bboxes_list, kpss_li │
│ 225 │ │ │
│ 226 │ │ scores = np.vstack(scores_list) │
│ 227 │ │ scores_ravel = scores.ravel() │
│ │
│ /opt/venv/lib/python3.10/site-packages/insightf │
│ ace/model_zoo/retinaface.py:152 in forward │
│ │
│ 149 │ │ kpss_list = [] │
│ 150 │ │ input_size = tuple(img.shape[0:2] │
│ 151 │ │ blob = cv2.dnn.blobFromImage(img, │
│ (self.input_mean, self.input_mean, self.i │
│ ❱ 152 │ │ net_outs = self.session.run(self. │
│ 153 │ │ │
│ 154 │ │ input_height = blob.shape[2] │
│ 155 │ │ input_width = blob.shape[3] │
│ │
│ /opt/venv/lib/python3.10/site-packages/onnxrunt │
│ ime/capi/onnxruntime_inference_collection.py:22 │
│ 0 in run │
│ │
│ 217 │ │ if not output_names: │
│ 218 │ │ │ output_names = [output.name │
│ 219 │ │ try: │
│ ❱ 220 │ │ │ return self._sess.run(output │
│ 221 │ │ except C.EPFail as err: │
│ 222 │ │ │ if self._enable_fallback: │
│ 223 │ │ │ │ print(f"EP Error: {err!s │
╰─────────────────────────────────────────────────╯
RuntimeException: [ONNXRuntimeError] : 6 :
RUNTIME_EXCEPTION : Encountered unknown exception
in Run()
For the first issue, models when using OpenVINO take some time to start on the first load since they're compiled to that format. But everything is supposed to happen in a separate thread pool without blocking the main thread, so I'm not sure what's causing this to be blocking.
For the second issue, OpenVINO is buggy and doesn't currently work with facial recognition, see #8226.
Thanks for clarifying about the OpenVINO bug, I'm following that one now.
The startup time for the machine learning container isn't straightforward though. If it immediately loaded the models when the container started, that would be fine because you can define a startupProbe in Kubernetes to protect the container during this period. But instead, the container starts up quickly and the model is not loaded until a request for ML comes in from Immich, which may happen a long time after the container has started, so there is no way to work around this in Kubernetes, other than to make the probes very generous.
I'm no software engineer, but it does sound odd that the main thread gets blocked.
In my case, I have no choice but to disable GPU acceleration for ML if OpenVINO doesn't support it, but I'll leave this issue open because it sounds like there is some investigation to be done with the thread pools, as @bo0tzz said that also affects the non-openvino container.
Thanks for your help, @mertalev. Have a nice weekend!
There are envs to preload certain models if it helps: setting MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai
and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l
will eagerly load those models at startup without waiting for a request.
I think this is because of the Python GIL. When it compiles to OpenVINO, it holds onto the GIL and prevents other threads from executing. That means it can't respond to probes during that time. Not sure what I can do about that short of putting it in a subprocess or something.
I can confirm this is still a problem. With #8226 now resolved, I had another crack at enabling hardware acceleration for my ML. I'm using the immich-machine-learning:main-openvino
image. It starts up properly and appears stable at idle, but once the first ML job is started, the container stops responding and gets killed.
[jonathan@latitude immich]$ kubectl logs -f immich-machine-learning-66cbb9cbb8-n4wqt
[08/06/24 16:23:27] INFO Starting gunicorn 22.0.0
[08/06/24 16:23:27] INFO Listening at: http://[::]:3003 (9)
[08/06/24 16:23:27] INFO Using worker: app.config.CustomUvicornWorker
[08/06/24 16:23:27] INFO Booting worker with pid: 10
[08/06/24 16:23:32] INFO Started server process [10]
[08/06/24 16:23:32] INFO Waiting for application startup.
[08/06/24 16:23:32] INFO Created in-memory cache with unloading after 300s
of inactivity.
[08/06/24 16:23:32] INFO Initialized request thread pool with 4 threads.
[08/06/24 16:23:32] INFO Application startup complete.
[08/06/24 16:58:04] INFO Loading visual model 'ViT-B-32__openai' to memory
[08/06/24 16:58:04] INFO Setting execution providers to
['OpenVINOExecutionProvider',
'CPUExecutionProvider'], in descending order of
preference
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 31s (x2 over 35m) kubelet Created container immich-machine-learning
Warning Unhealthy 31s (x3 over 51s) kubelet Liveness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Normal Killing 31s kubelet Container immich-machine-learning failed liveness probe, will be restarted
Warning Unhealthy 31s kubelet Readiness probe failed: Get "http://10.1.199.105:3003/ping": read tcp 192.168.0.57:59722->10.1.199.105:3003: read: connection reset by peer
Normal Pulled 31s kubelet Container image "ghcr.io/immich-app/immich-machine-learning:main-openvino" already present on machine
Normal Started 30s (x2 over 35m) kubelet Started container immich-machine-learning
Warning Unhealthy 30s (x3 over 35m) kubelet Readiness probe failed: Get "http://10.1.199.105:3003/ping": dial tcp 10.1.199.105:3003: connect: connection refused
Warning Unhealthy 28s (x5 over 35m) kubelet Readiness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Starting up with MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai
and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l
does indeed load the models up front, but that just causes the container to fail its probes earlier
The models only take that long the first time they're compiled. You can preload them and set a really high timeout for the probes to let it get compiled.
Good idea. Unfortunately that brings me to a second issue with the Helm chart where the startupProbe
is hardcoded to false but I have added initialDelaySeconds
to the liveness and readiness probes.
Here's my full, working ML block for a Kubernetes deployment via Helm chart, because examples/documentation seem a bit lacking for this :slightly_smiling_face:
machine-learning:
enabled: true
image:
# Specify OpenVINO image variant
tag: main-openvino
env:
TRANSFORMERS_CACHE: /cache
# Load ML models up front
MACHINE_LEARNING_PRELOAD__CLIP: ViT-B-32__openai
MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION: buffalo_l
resources:
requests:
memory: 1500Mi
cpu: 1
gpu.intel.com/i915: 1
limits:
gpu.intel.com/i915: 1
# Override probes to allow slow ML startup
probes:
liveness:
spec:
initialDelaySeconds: 120
readiness:
spec:
initialDelaySeconds: 120
persistence:
cache:
enabled: true
size: 10Gi
# Optional: Set this to pvc to avoid downloading the ML models every start.
type: pvc
accessMode: ReadWriteMany
storageClass: truenas
The bug
I have been using Immich with Kubernetes without problems, via the Helm chart. This week I have been working on enabling GPU support.
I noticed that a deployment of the image
immich-app/immich-machine-learning:v1.101.0
works completely fine, but when I switch toimmich-app/immich-machine-learning:v1.101.0-openvino
with no other changes, that container gets into CrashLoopBackoff after the model is loaded, because the Liveness probes fail and the container get repeatedly restarted.This issue is specific to Kubernetes, but it clearly demonstrates that there is some kind of difference in behaviour between the standard and openvino variants of the image, so I feel this issue belongs here.
The OS that Immich Server is running on
Kubernetes
Version of Immich Server
v1.101.0
Version of Immich Mobile App
1.101.0 build.147
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
Relevant log output
Additional information
No response