The openvino variant of the machine learning container responds differently to probes

The bug

I have been using Immich with Kubernetes without problems, via the Helm chart. This week I have been working on enabling GPU support.

I noticed that a deployment of the image immich-app/immich-machine-learning:v1.101.0 works completely fine, but when I switch to immich-app/immich-machine-learning:v1.101.0-openvino with no other changes, that container gets into CrashLoopBackoff after the model is loaded, because the Liveness probes fail and the container get repeatedly restarted.

This issue is specific to Kubernetes, but it clearly demonstrates that there is some kind of difference in behaviour between the standard and openvino variants of the image, so I feel this issue belongs here.

The OS that Immich Server is running on

Kubernetes

Version of Immich Server

v1.101.0

Version of Immich Mobile App

1.101.0 build.147

Platform with the issue

[X] Server
[ ] Web
[ ] Mobile

Your docker-compose.yml content

machine-learning:
  enabled: true
  image:
    tag: v1.101.0-openvino
  env:
    TRANSFORMERS_CACHE: /cache
  resources:
    requests:
      memory: 1500Mi
      cpu: 1
      gpu.intel.com/i915: 1
    limits:
      gpu.intel.com/i915: 1

Your .env content

N/A

Reproduction steps

1. Make a deployment of the openvino machine-learning image
2. Everything is fine
3. Upload an image to trigger the ML container to do something
4. ML container loads a model, stops responding to probes
5. Kubernetes restarts the container
6. GOTO 3

Relevant log output

[jonathan@poseidon immich]$ kubectl logs -f immich-machine-learning-696fc76fff-hg5bx
[04/19/24 13:45:03] INFO     Starting gunicorn 21.2.0                           
[04/19/24 13:45:03] INFO     Listening at: http://[::]:3003 (9)                 
[04/19/24 13:45:03] INFO     Using worker: app.config.CustomUvicornWorker       
[04/19/24 13:45:03] INFO     Booting worker with pid: 13                        
[04/19/24 13:45:06] INFO     Started server process [13]                        
[04/19/24 13:45:06] INFO     Waiting for application startup.                   
[04/19/24 13:45:06] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[04/19/24 13:45:06] INFO     Initialized request thread pool with 4 threads.    
[04/19/24 13:45:06] INFO     Application startup complete.                      
## Everything is fine up until now. Here I upload a new image to Immich and
## as soon as the model is loaded, the container stops responding to probes, so
## Kubernetes kills it.
[04/19/24 13:48:10] INFO     Setting 'ViT-B-32__openai' execution providers to  
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 13:48:10] INFO     Setting 'buffalo_l' execution providers to         
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 13:48:10] INFO     Loading clip model 'ViT-B-32__openai' to memory    
## Container is killed shortly after here.

Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   Scheduled               3m41s               default-scheduler        Successfully assigned immich/immich-machine-learning-696fc76fff-hg5bx to kube04
  Normal   SuccessfulAttachVolume  3m40s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-0ccd4181-398b-4d3c-ad97-bc876159b7c4"
  Warning  Unhealthy               3m35s               kubelet                  Readiness probe failed: Get "http://10.1.102.109:3003/ping": dial tcp 10.1.102.109:3003: connect: connection refused
  Warning  Unhealthy               5s (x3 over 25s)    kubelet                  Liveness probe failed: Get "http://10.1.102.109:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing                 5s                  kubelet                  Container immich-machine-learning failed liveness probe, will be restarted
  Warning  Unhealthy               4s                  kubelet                  Readiness probe failed: Get "http://10.1.102.109:3003/ping": read tcp 192.168.0.55:41504->10.1.102.109:3003: read: connection reset by peer
  Normal   Pulled                  4s (x2 over 3m35s)  kubelet                  Container image "ghcr.io/immich-app/immich-machine-learning:v1.101.0-openvino" already present on machine
  Normal   Created                 4s (x2 over 3m35s)  kubelet                  Created container immich-machine-learning
  Normal   Started                 4s (x2 over 3m35s)  kubelet                  Started container immich-machine-learning
  Warning  Unhealthy               1s (x7 over 3m33s)  kubelet                  Readiness probe failed: Get "http://10.1.102.109:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Additional information

No response

I think the probes for the ML deployment don't quite work right anyways, and if a model takes too long to load (or if it's being downloaded at startup through the preload env var) I've had my non-openvino instances occasionally get killed as well. It could be that the openvino image just takes slightly longer to load things, thus exposing that issue?

I'll have a play with the probes and see if I can get it more stable. The default values look pretty aggressive IMO. Timeout of 1s, period of 10s. If a probe fails, it gets retried immediately - so the liveness probe could actually kill the container after just 3 seconds, which is insane.

    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /ping
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

OK so I manually configured the probes to have longer timeouts and a great tolerance of failed probes. It didn't get killed by Kubernetes, but it crashed in a nasty way. I'm well out of my depth with this

[jonathan@poseidon immich]$ kubectl logs -f immich-machine-learning-78f87fcc4b-5s84r
[04/19/24 15:18:49] INFO     Starting gunicorn 21.2.0                           
[04/19/24 15:18:49] INFO     Listening at: http://[::]:3003 (10)                
[04/19/24 15:18:49] INFO     Using worker: app.config.CustomUvicornWorker       
[04/19/24 15:18:49] INFO     Booting worker with pid: 14                        
[04/19/24 15:18:52] INFO     Started server process [14]                        
[04/19/24 15:18:52] INFO     Waiting for application startup.                   
[04/19/24 15:18:52] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[04/19/24 15:18:52] INFO     Initialized request thread pool with 4 threads.    
[04/19/24 15:18:52] INFO     Application startup complete.                      
[04/19/24 15:18:54] INFO     Setting 'ViT-B-32__openai' execution providers to  
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 15:18:54] INFO     Loading clip model 'ViT-B-32__openai' to memory    
[04/19/24 15:18:54] INFO     Setting 'buffalo_l' execution providers to         
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference                                         
[04/19/24 15:20:19] INFO     Loading facial recognition model 'buffalo_l' to    
                             memory                                             
[04/19/24 15:21:11] ERROR    Exception in ASGI application                      

                             ╭─────── Traceback (most recent call last) ───────╮
                             │ /usr/src/app/main.py:118 in predict             │
                             │                                                 │
                             │   115 │                                         │
                             │   116 │   model = await load(await model_cache. │
                             │       ttl=settings.model_ttl, **kwargs))        │
                             │   117 │   model.configure(**kwargs)             │
                             │ ❱ 118 │   outputs = await run(model.predict, in │
                             │   119 │   return ORJSONResponse(outputs)        │
                             │   120                                           │
                             │   121                                           │
                             │                                                 │
                             │ /usr/src/app/main.py:125 in run                 │
                             │                                                 │
                             │   122 async def run(func: Callable[..., Any], i │
                             │   123 │   if thread_pool is None:               │
                             │   124 │   │   return func(inputs)               │
                             │ ❱ 125 │   return await asyncio.get_running_loop │
                             │   126                                           │
                             │   127                                           │
                             │   128 async def load(model: InferenceModel) ->  │
                             │                                                 │
                             │ /usr/lib/python3.10/concurrent/futures/thread.p │
                             │ y:58 in run                                     │
                             │                                                 │
                             │ /usr/src/app/models/base.py:59 in predict       │
                             │                                                 │
                             │    56 │   │   self.load()                       │
                             │    57 │   │   if model_kwargs:                  │
                             │    58 │   │   │   self.configure(**model_kwargs │
                             │ ❱  59 │   │   return self._predict(inputs)      │
                             │    60 │                                         │
                             │    61 │   @abstractmethod                       │
                             │    62 │   def _predict(self, inputs: Any) -> An │
                             │                                                 │
                             │ /usr/src/app/models/facial_recognition.py:49 in │
                             │ _predict                                        │
                             │                                                 │
                             │   46 │   │   else:                              │
                             │   47 │   │   │   decoded_image = image          │
                             │   48 │   │   assert is_ndarray(decoded_image, n │
                             │ ❱ 49 │   │   bboxes, kpss = self.det_model.dete │
                             │   50 │   │   if bboxes.size == 0:               │
                             │   51 │   │   │   return []                      │
                             │   52 │   │   assert is_ndarray(kpss, np.float32 │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/insightf │
                             │ ace/model_zoo/retinaface.py:224 in detect       │
                             │                                                 │
                             │   221 │   │   det_img = np.zeros( (input_size[1 │
                             │   222 │   │   det_img[:new_height, :new_width,  │
                             │   223 │   │                                     │
                             │ ❱ 224 │   │   scores_list, bboxes_list, kpss_li │
                             │   225 │   │                                     │
                             │   226 │   │   scores = np.vstack(scores_list)   │
                             │   227 │   │   scores_ravel = scores.ravel()     │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/insightf │
                             │ ace/model_zoo/retinaface.py:152 in forward      │
                             │                                                 │
                             │   149 │   │   kpss_list = []                    │
                             │   150 │   │   input_size = tuple(img.shape[0:2] │
                             │   151 │   │   blob = cv2.dnn.blobFromImage(img, │
                             │       (self.input_mean, self.input_mean, self.i │
                             │ ❱ 152 │   │   net_outs = self.session.run(self. │
                             │   153 │   │                                     │
                             │   154 │   │   input_height = blob.shape[2]      │
                             │   155 │   │   input_width = blob.shape[3]       │
                             │                                                 │
                             │ /opt/venv/lib/python3.10/site-packages/onnxrunt │
                             │ ime/capi/onnxruntime_inference_collection.py:22 │
                             │ 0 in run                                        │
                             │                                                 │
                             │    217 │   │   if not output_names:             │
                             │    218 │   │   │   output_names = [output.name  │
                             │    219 │   │   try:                             │
                             │ ❱  220 │   │   │   return self._sess.run(output │
                             │    221 │   │   except C.EPFail as err:          │
                             │    222 │   │   │   if self._enable_fallback:    │
                             │    223 │   │   │   │   print(f"EP Error: {err!s │
                             ╰─────────────────────────────────────────────────╯
                             RuntimeException: [ONNXRuntimeError] : 6 :         
                             RUNTIME_EXCEPTION : Encountered unknown exception  
                             in Run()

For the first issue, models when using OpenVINO take some time to start on the first load since they're compiled to that format. But everything is supposed to happen in a separate thread pool without blocking the main thread, so I'm not sure what's causing this to be blocking.

For the second issue, OpenVINO is buggy and doesn't currently work with facial recognition, see #8226.

Thanks for clarifying about the OpenVINO bug, I'm following that one now.

The startup time for the machine learning container isn't straightforward though. If it immediately loaded the models when the container started, that would be fine because you can define a startupProbe in Kubernetes to protect the container during this period. But instead, the container starts up quickly and the model is not loaded until a request for ML comes in from Immich, which may happen a long time after the container has started, so there is no way to work around this in Kubernetes, other than to make the probes very generous.

I'm no software engineer, but it does sound odd that the main thread gets blocked.

In my case, I have no choice but to disable GPU acceleration for ML if OpenVINO doesn't support it, but I'll leave this issue open because it sounds like there is some investigation to be done with the thread pools, as @bo0tzz said that also affects the non-openvino container.

Thanks for your help, @mertalev. Have a nice weekend!

There are envs to preload certain models if it helps: setting MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l will eagerly load those models at startup without waiting for a request.

I think this is because of the Python GIL. When it compiles to OpenVINO, it holds onto the GIL and prevents other threads from executing. That means it can't respond to probes during that time. Not sure what I can do about that short of putting it in a subprocess or something.

I can confirm this is still a problem. With #8226 now resolved, I had another crack at enabling hardware acceleration for my ML. I'm using the immich-machine-learning:main-openvino image. It starts up properly and appears stable at idle, but once the first ML job is started, the container stops responding and gets killed.

[jonathan@latitude immich]$ kubectl logs -f immich-machine-learning-66cbb9cbb8-n4wqt
[08/06/24 16:23:27] INFO     Starting gunicorn 22.0.0                           
[08/06/24 16:23:27] INFO     Listening at: http://[::]:3003 (9)                 
[08/06/24 16:23:27] INFO     Using worker: app.config.CustomUvicornWorker       
[08/06/24 16:23:27] INFO     Booting worker with pid: 10                        
[08/06/24 16:23:32] INFO     Started server process [10]                        
[08/06/24 16:23:32] INFO     Waiting for application startup.                   
[08/06/24 16:23:32] INFO     Created in-memory cache with unloading after 300s  
                             of inactivity.                                     
[08/06/24 16:23:32] INFO     Initialized request thread pool with 4 threads.    
[08/06/24 16:23:32] INFO     Application startup complete.                      
[08/06/24 16:58:04] INFO     Loading visual model 'ViT-B-32__openai' to memory  
[08/06/24 16:58:04] INFO     Setting execution providers to                     
                             ['OpenVINOExecutionProvider',                      
                             'CPUExecutionProvider'], in descending order of    
                             preference

Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Normal   Created    31s (x2 over 35m)  kubelet  Created container immich-machine-learning
  Warning  Unhealthy  31s (x3 over 51s)  kubelet  Liveness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    31s                kubelet  Container immich-machine-learning failed liveness probe, will be restarted
  Warning  Unhealthy  31s                kubelet  Readiness probe failed: Get "http://10.1.199.105:3003/ping": read tcp 192.168.0.57:59722->10.1.199.105:3003: read: connection reset by peer
  Normal   Pulled     31s                kubelet  Container image "ghcr.io/immich-app/immich-machine-learning:main-openvino" already present on machine
  Normal   Started    30s (x2 over 35m)  kubelet  Started container immich-machine-learning
  Warning  Unhealthy  30s (x3 over 35m)  kubelet  Readiness probe failed: Get "http://10.1.199.105:3003/ping": dial tcp 10.1.199.105:3003: connect: connection refused
  Warning  Unhealthy  28s (x5 over 35m)  kubelet  Readiness probe failed: Get "http://10.1.199.105:3003/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Starting up with MACHINE_LEARNING_PRELOAD__CLIP=ViT-B-32__openai and MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION=buffalo_l does indeed load the models up front, but that just causes the container to fail its probes earlier

The models only take that long the first time they're compiled. You can preload them and set a really high timeout for the probes to let it get compiled.

Good idea. Unfortunately that brings me to a second issue with the Helm chart where the startupProbe is hardcoded to false but I have added initialDelaySeconds to the liveness and readiness probes.

Here's my full, working ML block for a Kubernetes deployment via Helm chart, because examples/documentation seem a bit lacking for this :slightly_smiling_face:

machine-learning:
  enabled: true
  image:
    # Specify OpenVINO image variant
    tag: main-openvino
  env:
    TRANSFORMERS_CACHE: /cache
    # Load ML models up front
    MACHINE_LEARNING_PRELOAD__CLIP: ViT-B-32__openai
    MACHINE_LEARNING_PRELOAD__FACIAL_RECOGNITION: buffalo_l
  resources:
    requests:
      memory: 1500Mi
      cpu: 1
      gpu.intel.com/i915: 1
    limits:
      gpu.intel.com/i915: 1
  # Override probes to allow slow ML startup
  probes:
    liveness:
      spec:
        initialDelaySeconds: 120
    readiness:
      spec:
        initialDelaySeconds: 120
  persistence:
    cache:
      enabled: true
      size: 10Gi
      # Optional: Set this to pvc to avoid downloading the ML models every start.
      type: pvc
      accessMode: ReadWriteMany
      storageClass: truenas

immich-app / immich