immich-machine-learning restarting

mvivaldi commented 1 year ago

Hi, after the update to 1.72.1 the machine-learning container does not start:

Back-off restarting failed container immich-machine-learning in pod immich-machine-learning-5777ffff49-kdqcr_immich(d635ea9c-2008-4bb6-b434-2be763abdcda)

with version 1.71.0 everything was working fine

here the logs in the container:

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
ERROR:    Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/opt/venv/lib/python3.11/site-packages/starlette/routing.py", line 566, in __aenter__
    await self._router.startup()
  File "/opt/venv/lib/python3.11/site-packages/starlette/routing.py", line 654, in startup
    await handler()
  File "/usr/src/app/main.py", line 46, in startup_event
    await load_models()
  File "/usr/src/app/main.py", line 40, in load_models
    await app.state.model_cache.get(model_name, model_type, eager=settings.eager_startup)
  File "/usr/src/app/models/cache.py", line 53, in get
    model = InferenceModel.from_model_type(model_type, model_name, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/models/base.py", line 78, in from_model_type
    return subclasses[model_type](model_name, **model_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/models/facial_recognition.py", line 27, in __init__
    super().__init__(model_name, cache_dir, **model_kwargs)
  File "/usr/src/app/models/base.py", line 25, in __init__
    loader(**model_kwargs)
  File "/usr/src/app/models/base.py", line 35, in load
    self.download(**model_kwargs)
  File "/usr/src/app/models/base.py", line 32, in download
    self._download(**model_kwargs)
  File "/usr/src/app/models/facial_recognition.py", line 32, in _download
    with zipfile.ZipFile(zip_file, "r") as zip:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/zipfile.py", line 1302, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python3.11/zipfile.py", line 1369, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

ERROR:    Application startup failed. Exiting.

and the values:

machine-learning:
  enabled: true
  probes:
    liveness:
      spec:
        initialDelaySeconds: 240
  image:
    repository: ghcr.io/immich-app/immich-machine-learning
    pullPolicy: IfNotPresent
  env:
    TRANSFORMERS_CACHE: /cache
  persistence:
    cache:
      enabled: true
      size: 10Gi
      # Optional: Set this to pvc to avoid downloading the ML models every start.
      type: emptyDir
      accessMode: ReadWriteOnce
      storageClass: local-path-immich

the k8s cluster is a single node microk8s (and I use local-path for the storage).

Thank you

bo0tzz commented 1 year ago

Can you try deleting/clearing the cache volume that the ml pod uses? It looks like there might be some bad state in there.

mvivaldi commented 1 year ago

Hi, I tried but with the same problem.

bo0tzz commented 1 year ago

@mertalev do you have any insight in why this could happen?

mvivaldi commented 1 year ago

No, I upgraded like I do with every new versions... maybe there is a problem with the downloads? here the content of the pvc:

├── clip
│   └── clip-ViT-B-32
│       └── sentence-transformers_clip-ViT-B-32
│           ├── 0_CLIPModel
│           │   ├── config.json
│           │   ├── merges.txt
│           │   ├── preprocessor_config.json
│           │   ├── pytorch_model.bin
│           │   ├── special_tokens_map.json
│           │   ├── tokenizer_config.json
│           │   └── vocab.json
│           ├── README.md
│           ├── config_sentence_transformers.json
│           └── modules.json
├── facial-recognition
│   └── buffalo_l
│       └── buffalo_l.zip
├── image-classification
│   └── microsoft
│       └── resnet-50
│           └── models--microsoft--resnet-50
│               ├── blobs
│               │   ├── 30289c9792e668d73c991829c8842977e2b90539
│               │   ├── 9a46cca81138ce49069d63f688ae5750882df07e
│               │   └── ff8163a1323333126706d649ce73ecd76e45d241b42d623dea6c723690cafe07
│               ├── refs
│               │   └── main
│               └── snapshots
│                   └── 4067a2728b9c93fbd67b9d5a30b03495ac74a46e
│                       ├── config.json -> ../../blobs/30289c9792e668d73c991829c8842977e2b90539
│                       ├── preprocessor_config.json -> ../../blobs/9a46cca81138ce49069d63f688ae5750882df07e
│                       └── pytorch_model.bin -> ../../blobs/ff8163a1323333126706d649ce73ecd76e45d241b42d623dea6c723690cafe07
└── version.txt

mvivaldi commented 1 year ago

ok found the problem:

-rw-r--r-- 1 root root 123338752 Aug  7 10:41 buffalo_l.zip
root@microk8s:/data/immich/pvc-f8a82b03-c5ad-4f4d-b601-579514320972_immich_immich-machine-learning-cache/facial-recognition/buffalo_l# unzip buffalo_l.zip 
Archive:  buffalo_l.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of buffalo_l.zip or
        buffalo_l.zip.zip, and cannot find buffalo_l.zip.ZIP, period.

mvivaldi commented 1 year ago

re-downloaded the file manually from here: https://github.com/deepinsight/insightface/releases put in the pvc and now it's working

but I tried to clean the cache at least 5 or 6 times and the download was always corrupted...

mertalev commented 1 year ago

The base class clears the cache on an OSError, but since BadZipFile isn't an OSError it would download directly on the old zip file without deleting it first. Not sure if this would cause issues.

But it's weird that removing the model cache volume manually didn't work. The actual download handler for both versions is the same, so I wouldn't expect its behavior to be different.

bo0tzz commented 1 year ago

it would download directly on the old zip file without deleting it first. Not sure if this would cause issues.

Depending on what sort of write mode it uses I can see that potentially mangling things, yeah.

mvivaldi commented 1 year ago

ok, a did a test: cleared all the cache of the container and restarted it. The problem is here again.

root@microk8s:/data/immich/pvc-f8a82b03-c5ad-4f4d-b601-579514320972_immich_immich-machine-learning-cache/facial-recognition/buffalo_l# md5sum buffalo_l.zip 
eeaedbe8a45ebc785e6c0d484ca983b5  buffalo_l.zip
root@microk8s:/data/immich/pvc-f8a82b03-c5ad-4f4d-b601-579514320972_immich_immich-machine-learning-cache/facial-recognition/buffalo_l# ls -la
total 124488
drwxr-xr-x 2 root root         3 Aug  8 08:07 .
drwxr-xr-x 3 root root         3 Aug  8 08:07 ..
-rw-r--r-- 1 root root 127533056 Aug  8 08:08 buffalo_l.zip

the file is the wrong size. After deleting only buffalo_l.zip and restarted the pod, the new downloaded file is ok and everything is working fine.

I don't known what the problem is but if I am the only one affected we can close it

immich-app / immich-charts

immich-machine-learning restarting #39