Machine Learning CrashLoopBackoff

Y0ngg4n commented 1 year ago

I always get a crashloopbackoff with the machinelearning container:

INFO:     Started server process [7]
INFO:     Waiting for application startup.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.

bo0tzz commented 1 year ago

This might be a duplicate of #27, can you try the suggestion in there?

Y0ngg4n commented 1 year ago

@bo0tzz already completely disabled the liveness probe

bo0tzz commented 1 year ago

Do you have the cache volume enabled for ML?

Y0ngg4n commented 1 year ago

@bo0tzz no

machine-learning:
      resources:
        limits:
          cpu: 200m
          memory: 500Mi
      probes:
        liveness:
          enabled: false
          spec:
            initialDelaySeconds: 90
        readiness:
          enabled: false
          spec:
            initialDelaySeconds: 90
        startup:
          enabled: false
          spec:
            initialDelaySeconds: 90

bo0tzz commented 1 year ago

That memory limit is probably too low, which would cause it to be OOMKilled.

Y0ngg4n commented 1 year ago

@bo0tzz a that makes sense. what is a good limit?

bo0tzz commented 1 year ago

I have mine set to 4G, and it looks like it's idling at 1.7G right now.

Y0ngg4n commented 1 year ago

@bo0tzz i tried it with 4G too. Did not work either: Seems like the container starts:

INFO:     Started server process [7]
INFO:     Waiting for application startup.
Downloading (…)lve/main/config.json: 100%|██████████| 69.6k/69.6k [00:00<00:00, 657kB/s]
Downloading pytorch_model.bin: 100%|██████████| 103M/103M [00:04<00:00, 21.4MB/s] 
Downloading (…)rocessor_config.json: 100%|██████████| 266/266 [00:00<00:00, 1.41MB/s]
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Downloading (…)d52eb/.gitattributes: 100%|██████████| 690/690 [00:00<00:00, 4.54MB/s]
Downloading (…)LIPModel/config.json: 100%|██████████| 4.03k/4.03k [00:00<00:00, 32.4MB/s]
Downloading (…)CLIPModel/merges.txt: 100%|██████████| 525k/525k [00:00<00:00, 2.46MB/s]
Downloading (…)rocessor_config.json: 100%|██████████| 316/316 [00:00<00:00, 2.49MB/s]
Downloading pytorch_model.bin: 100%|██████████| 605M/605M [00:28<00:00, 21.5MB/s] 
Downloading (…)cial_tokens_map.json: 100%|██████████| 389/389 [00:00<00:00, 3.04MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 604/604 [00:00<00:00, 3.92MB/s]
Downloading (…)CLIPModel/vocab.json: 100%|██████████| 961k/961k [00:00<00:00, 7.22MB/s]
Downloading (…)859cad52eb/README.md: 100%|██████████| 1.88k/1.88k [00:00<00:00, 15.2MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 975kB/s]
Downloading (…)cad52eb/modules.json: 100%|██████████| 122/122 [00:00<00:00, 1.06MB/s]

Then the container restarts and i get this:

INFO:     Started server process [7]
INFO:     Waiting for application startup.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
download_path: /cache/facial-recognition/buffalo_l/models/buffalo_l
Downloading /cache/facial-recognition/buffalo_l/models/buffalo_l.zip from https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_l.zip...
  4%|▍         | 11396/281857 [00:24<10:29, 429.31KB/s]⏎

Then the container restarts again. do i have the probes wrong?

bo0tzz commented 1 year ago

With all the probes off I don't see why it would still be restarting, but at least it seems like it's making it further now. Try adding the cache volume?

tvories commented 1 year ago

I'm having this same issue. Going to try the probe changes

Edit:

Turning off the probes the machine learning pod comes up healthy.

Wow, it's sitting at 6.2G memory, though. That is pretty substantial.

bo0tzz commented 11 months ago

With recent changes in how ML starts up, I think this issue should be resolved. Please reopen it if you encounter this again.

immich-app / immich-charts

Machine Learning CrashLoopBackoff #37