Closed dpantel closed 3 months ago
Can you test without running as a non-root user? Looks like the container has permission issue to download and write the model to the filesystem
The models are able to be downloaded and used for both face and object recognition, if I switch back to the root user. So it's definitely a permissions issue.
I wonder if this is part of a larger permissions issue throughout the app. As I commented today on a thread about geocoding as a non-root user, I have noticed error messages there too, even after following directions in the FAQ.
Hello,
I have the same issue as you @dpantel (using docker-compose). I followed the steps from https://immich.app/docs/install/docker-compose. I use the current release : v1.91.0
I'm also using a non-root user for docker.
[Nest] 7 - 12/16/2023, 1:42:10 AM ERROR [JobService] Unable to run job handler (clipEncoding/clip-encode): Error: Request for clip failed with status 500: Internal Server Error
[Nest] 7 - 12/16/2023, 1:42:10 AM ERROR [JobService] Error: Request for clip failed with status 500: Internal Server Error
at MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/domain/smart-info/smart-info.service.js:102:31)
at async /usr/src/app/dist/domain/job/job.service.js:113:37
at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:387:28)
at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:574:24)
[Nest] 7 - 12/16/2023, 1:42:10 AM ERROR [JobService] Object:
{
"id": "077e2847-88fd-4067-bc6e-4dfeb4bd7a6d",
"source": "upload"
}
What am I supposed to do ? Thanks for your help..
EDIT : I had network issue with docker, seems after the download of the image and the run of the container, there is a need to download some dependencies in immich_machine_learning
:
Downloading README.md: 100%|██████████| 582/582 [00:00<00:00, 3.70MB/s]
Downloading .gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 8.52MB/s]
Downloading model.onnx: 100%|██████████| 16.9M/16.9M [00:03<00:00, 4.94MB/s]
Downloading model.onnx: 100%|██████████| 174M/174M [00:17<00:00, 10.0MB/s]]
Fetching 4 files: 100%|██████████| 4/4 [00:17<00:00, 4.47s/it], 11.1MB/s] ]
[12/17/23 01:05:35] INFO Loading facial recognition model 'buffalo_l'
[12/17/23 01:05:36] INFO Downloading clip model 'ViT-B-32__openai'.This may
take a while.
Downloading config.json: 100%|██████████| 196/196 [00:00<00:00, 1.15MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 472/472 [00:00<00:00, 2.10MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 704/704 [00:00<00:00, 4.32MB/s]
Downloading .gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 8.60MB/s]
Downloading textual/merges.txt: 100%|██████████| 525k/525k [00:00<00:00, 8.10MB/s]
Downloading README.md: 100%|██████████| 422/422 [00:00<00:00, 1.93MB/s]
Downloading textual/tokenizer.json: 100%|██████████| 2.22M/2.22M [00:00<00:00, 6.75MB/s]
Downloading (…)/preprocess_cfg.json: 100%|██████████| 197/197 [00:00<00:00, 1.01MB/s]
Downloading textual/vocab.json: 100%|██████████| 862k/862k [00:00<00:00, 1.92MB/s]
Downloading model.onnx: 100%|██████████| 254M/254M [00:46<00:00, 5.46MB/s]:00, 6.84MB/s]
Downloading model.onnx: 100%|██████████| 351M/351M [00:53<00:00, 6.55MB/s] 93MB/s]
Fetching 11 files: 100%|██████████| 11/11 [00:55<00:00, 5.01s/it]
[12/17/23 01:06:33] INFO Loading clip model 'ViT-B-32__openai'
Downloading model.onnx: 100%|██████████| 351M/351M [00:53<00:00, 10.6MB/s]
Seeing the same on a fresh install run as non-root user:
[12/26/23 21:29:19] INFO Starting gunicorn 21.2.0
[12/26/23 21:29:19] INFO Listening at: http://0.0.0.0:3003 (9)
[12/26/23 21:29:19] INFO Using worker: app.config.CustomUvicornWorker
[12/26/23 21:29:19] INFO Booting worker with pid: 13
There was a problem when trying to write in your cache folder (/cache). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
[12/26/23 21:29:23] WARNING Matplotlib created a temporary cache directory at
/tmp/matplotlib-tmq8ybjj because the default path
(/.config/matplotlib) is not a writable directory;
it is highly recommended to set the MPLCONFIGDIR
environment variable to a writable directory, in
particular to speed up the import of Matplotlib and
to better support multiprocessing.
[12/26/23 21:29:28] INFO Created in-memory cache with unloading after 300s
of inactivity.
[12/26/23 21:29:28] INFO Initialized request thread pool with 4 threads.
[12/26/23 22:17:56] INFO Downloading clip model 'ViT-B-32__openai'.This may
take a while.
[12/26/23 22:17:56] INFO Downloading facial recognition model
'buffalo_l'.This may take a while.
[12/26/23 22:17:56] WARNING Failed to load clip model
'ViT-B-32__openai'.Clearing cache and retrying.
[12/26/23 22:17:56] WARNING Attempted to clear cache for model
'ViT-B-32__openai' but cache directory does not
exist.
Makes sense as everything (mounted volume /cache, start directory /usr/src/app, ...) in the container is owned by root
.
My only concern is that if we change the container to non-root, existing cache files would still be owned by root. The service should still be able to load these files (assuming the non-root user at least has read permission), but it won't be able to delete them or download new models until manual intervention.
We could make this more graceful by updating the cache folder's permissions in one version and changing to non-root in a later one. The goal would be to set the permissions such that the non-root user still has r/w access to the cache folder. The limitation of this is that it wouldn't help much in cases where there's a custom user set for the container, or where the admin leapfrogs straight to the non-root container.
Alternatively, we could also just bite the bullet and make this change, expecting users to run a command if they run into permission issues.
https://immich.app/docs/FAQ/#how-can-i-run-immich-as-a-non-root-user
Running ML as non root is supported as outlined in the FAQ above, so I am closing this issue for now.
The bug
When trying to run facial recognition job, it fails with the following error message:
The OS that Immich Server is running on
Debian GNU/Linux 11 (bullseye)
Version of Immich Server
1.85.0
Version of Immich Mobile App
n/a
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
Additional information
I am running Immich as a non-root user, if that makes a difference.