immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
53.5k stars 2.83k forks source link

Run Machine Learning without root #4903

Closed dpantel closed 3 months ago

dpantel commented 1 year ago

The bug

When trying to run facial recognition job, it fails with the following error message:

immich_machine_learning  | [11/08/23 16:00:37] INFO     Downloading facial recognition model               
immich_machine_learning  |                              'buffalo_l'.This may take a while.                 
immich_machine_learning  | [11/08/23 16:00:37] WARNING  Failed to load facial-recognition model            
immich_machine_learning  |                              'buffalo_l'.Clearing cache and retrying.           
immich_machine_learning  | [11/08/23 16:00:37] WARNING  Attempted to clear cache for model 'buffalo_l' but 
immich_machine_learning  |                              cache directory does not exist.                    
immich_machine_learning  | [11/08/23 16:00:37] INFO     Downloading facial recognition model               
immich_machine_learning  |                              'buffalo_l'.This may take a while.                 
immich_machine_learning  | Exception in ASGI application
immich_machine_learning  | Traceback (most recent call last):
immich_machine_learning  |   File "/usr/src/app/main.py", line 101, in load
immich_machine_learning  |     await loop.run_in_executor(app.state.thread_pool, _load)
immich_machine_learning  |   File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
immich_machine_learning  |     result = self.fn(*self.args, **self.kwargs)
immich_machine_learning  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/usr/src/app/main.py", line 94, in _load
immich_machine_learning  |     model.load()
immich_machine_learning  |   File "/usr/src/app/models/base.py", line 63, in load
immich_machine_learning  |     self.download()
immich_machine_learning  |   File "/usr/src/app/models/base.py", line 58, in download
immich_machine_learning  |     self._download()
immich_machine_learning  |   File "/usr/src/app/models/facial_recognition.py", line 32, in _download
immich_machine_learning  |     download_file(f"{BASE_REPO_URL}/{self.model_name}.zip", zip_file)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/insightface/utils/download.py", line 68, in download_file
immich_machine_learning  |     os.makedirs(dirname)
immich_machine_learning  |   File "<frozen os>", line 215, in makedirs
immich_machine_learning  |   File "<frozen os>", line 225, in makedirs
immich_machine_learning  | PermissionError: [Errno 13] Permission denied: '/cache/facial-recognition'
immich_machine_learning  | 
immich_machine_learning  | During handling of the above exception, another exception occurred:
immich_machine_learning  | 
immich_machine_learning  | Traceback (most recent call last):
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi
immich_machine_learning  |     result = await app(  # type: ignore[func-returns-value]
immich_machine_learning  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
immich_machine_learning  |     return await self.app(scope, receive, send)
immich_machine_learning  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/fastapi/applications.py", line 276, in __call__
immich_machine_learning  |     await super().__call__(scope, receive, send)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
immich_machine_learning  |     await self.middleware_stack(scope, receive, send)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
immich_machine_learning  |     raise exc
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
immich_machine_learning  |     await self.app(scope, receive, _send)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
immich_machine_learning  |     raise exc
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
immich_machine_learning  |     await self.app(scope, receive, sender)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
immich_machine_learning  |     raise e
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
immich_machine_learning  |     await self.app(scope, receive, send)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
immich_machine_learning  |     await route.handle(scope, receive, send)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
immich_machine_learning  |     await self.app(scope, receive, send)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
immich_machine_learning  |     response = await func(request)
immich_machine_learning  |                ^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/fastapi/routing.py", line 237, in app
immich_machine_learning  |     raw_response = await run_endpoint_function(
immich_machine_learning  |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
immich_machine_learning  |     return await dependant.call(**values)
immich_machine_learning  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/usr/src/app/main.py", line 75, in predict
immich_machine_learning  |     model = await load(await app.state.model_cache.get(model_name, model_type, **kwargs))
immich_machine_learning  |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/usr/src/app/main.py", line 114, in load
immich_machine_learning  |     await loop.run_in_executor(app.state.thread_pool, _load)
immich_machine_learning  |   File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
immich_machine_learning  |     result = self.fn(*self.args, **self.kwargs)
immich_machine_learning  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
immich_machine_learning  |   File "/usr/src/app/main.py", line 94, in _load
immich_machine_learning  |     model.load()
immich_machine_learning  |   File "/usr/src/app/models/base.py", line 63, in load
immich_machine_learning  |     self.download()
immich_machine_learning  |   File "/usr/src/app/models/base.py", line 58, in download
immich_machine_learning  |     self._download()
immich_machine_learning  |   File "/usr/src/app/models/facial_recognition.py", line 32, in _download
immich_machine_learning  |     download_file(f"{BASE_REPO_URL}/{self.model_name}.zip", zip_file)
immich_machine_learning  |   File "/opt/venv/lib/python3.11/site-packages/insightface/utils/download.py", line 68, in download_file
immich_machine_learning  |     os.makedirs(dirname)
immich_machine_learning  |   File "<frozen os>", line 215, in makedirs
immich_machine_learning  |   File "<frozen os>", line 225, in makedirs
immich_machine_learning  | PermissionError: [Errno 13] Permission denied: '/cache/facial-recognition'
immich_microservices     | [Nest] 7  - 11/08/2023, 4:00:37 PM   ERROR [JobService] Unable to run job handler (recognizeFaces/recognize-faces): Error: Request for facial recognition failed with status 500: Internal Server Error
immich_microservices     | [Nest] 7  - 11/08/2023, 4:00:37 PM   ERROR [JobService] Error: Request for facial recognition failed with status 500: Internal Server Error
immich_microservices     |     at MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:19)
immich_microservices     |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
immich_microservices     |     at async PersonService.handleRecognizeFaces (/usr/src/app/dist/domain/person/person.service.js:183:23)
immich_microservices     |     at async /usr/src/app/dist/domain/job/job.service.js:108:37
immich_microservices     |     at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:350:28)
immich_microservices     |     at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:535:24)
immich_microservices     | [Nest] 7  - 11/08/2023, 4:00:37 PM   ERROR [JobService] Object:
immich_microservices     | {
immich_microservices     |   "id": "14ae2622-f3f3-48d1-aca3-5ae90d0b26b0"
immich_microservices     | }
immich_microservices     | 

The OS that Immich Server is running on

Debian GNU/Linux 11 (bullseye)

Version of Immich Server

1.85.0

Version of Immich Mobile App

n/a

Platform with the issue

Your docker-compose.yml content

# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: ["start.sh", "immich"]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - /mnt/media:/mnt/media:ro
    env_file:
      - .env
    user: "${PUID}:${PGID}"
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends:
    #   file: hwaccel.yml
    #   service: hwaccel
    command: ["start.sh", "microservices"]
    volumes:
      # this line needed due to non-root user
      - /usr/src/app/.reverse-geocoding-dump
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - /mnt/media:/mnt/media:ro
    env_file:
      - .env
    user: "${PUID}:${PGID}"
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    user: "${PUID}:${PGID}"
    restart: always

  immich-web:
    container_name: immich_web
    image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release}
    env_file:
      - .env
    restart: always

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
      # remove this to get debug messages
      - GLOG_minloglevel=1
    volumes:
      - tsdata:/data
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
#removed with last update
#    environment:
#      # Make sure these values get passed through from the env file
#      - IMMICH_SERVER_URL
#      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always

volumes:
  pgdata:
  model-cache:
  tsdata:

Your .env content

# https://github.com/immich-app/immich/releases/latest/download/example.env

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=/opt/immich/upload

# User/group to run Immich
PUID=998
PGID=998

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secrets for postgres and typesense. You should change these to random passwords
TYPESENSE_API_KEY=<KEY>
DB_PASSWORD=<PASS>

# The values below this line do not need to be changed
###################################################################################
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

REDIS_HOSTNAME=immich_redis

Reproduction steps

1. pull down latest image (1.85.0)
1. restore DB
1. start Recognize Faces job from admin screen

Additional information

I am running Immich as a non-root user, if that makes a difference.

alextran1502 commented 1 year ago

Can you test without running as a non-root user? Looks like the container has permission issue to download and write the model to the filesystem

dpantel commented 1 year ago

The models are able to be downloaded and used for both face and object recognition, if I switch back to the root user. So it's definitely a permissions issue.

I wonder if this is part of a larger permissions issue throughout the app. As I commented today on a thread about geocoding as a non-root user, I have noticed error messages there too, even after following directions in the FAQ.

hafx commented 11 months ago

Hello,

I have the same issue as you @dpantel (using docker-compose). I followed the steps from https://immich.app/docs/install/docker-compose. I use the current release : v1.91.0

I'm also using a non-root user for docker.

[Nest] 7  - 12/16/2023, 1:42:10 AM   ERROR [JobService] Unable to run job handler (clipEncoding/clip-encode): Error: Request for clip failed with status 500: Internal Server Error
[Nest] 7  - 12/16/2023, 1:42:10 AM   ERROR [JobService] Error: Request for clip failed with status 500: Internal Server Error
    at MachineLearningRepository.post (/usr/src/app/dist/infra/repositories/machine-learning.repository.js:18:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/domain/smart-info/smart-info.service.js:102:31)
    at async /usr/src/app/dist/domain/job/job.service.js:113:37
    at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:387:28)
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:574:24)
[Nest] 7  - 12/16/2023, 1:42:10 AM   ERROR [JobService] Object:
{
  "id": "077e2847-88fd-4067-bc6e-4dfeb4bd7a6d",
  "source": "upload"
}

What am I supposed to do ? Thanks for your help..

EDIT : I had network issue with docker, seems after the download of the image and the run of the container, there is a need to download some dependencies in immich_machine_learning:

Downloading README.md: 100%|██████████| 582/582 [00:00<00:00, 3.70MB/s]
Downloading .gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 8.52MB/s]
Downloading model.onnx: 100%|██████████| 16.9M/16.9M [00:03<00:00, 4.94MB/s]
Downloading model.onnx: 100%|██████████| 174M/174M [00:17<00:00, 10.0MB/s]]
Fetching 4 files: 100%|██████████| 4/4 [00:17<00:00,  4.47s/it], 11.1MB/s] ]
[12/17/23 01:05:35] INFO     Loading facial recognition model 'buffalo_l'       
[12/17/23 01:05:36] INFO     Downloading clip model 'ViT-B-32__openai'.This may 
                             take a while.                                      
Downloading config.json: 100%|██████████| 196/196 [00:00<00:00, 1.15MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 472/472 [00:00<00:00, 2.10MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 704/704 [00:00<00:00, 4.32MB/s]
Downloading .gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 8.60MB/s]
Downloading textual/merges.txt: 100%|██████████| 525k/525k [00:00<00:00, 8.10MB/s]
Downloading README.md: 100%|██████████| 422/422 [00:00<00:00, 1.93MB/s]
Downloading textual/tokenizer.json: 100%|██████████| 2.22M/2.22M [00:00<00:00, 6.75MB/s]
Downloading (…)/preprocess_cfg.json: 100%|██████████| 197/197 [00:00<00:00, 1.01MB/s]
Downloading textual/vocab.json: 100%|██████████| 862k/862k [00:00<00:00, 1.92MB/s]
Downloading model.onnx: 100%|██████████| 254M/254M [00:46<00:00, 5.46MB/s]:00, 6.84MB/s]
Downloading model.onnx: 100%|██████████| 351M/351M [00:53<00:00, 6.55MB/s] 93MB/s]
Fetching 11 files: 100%|██████████| 11/11 [00:55<00:00,  5.01s/it]
[12/17/23 01:06:33] INFO     Loading clip model 'ViT-B-32__openai'              
Downloading model.onnx: 100%|██████████| 351M/351M [00:53<00:00, 10.6MB/s]
jansohn commented 11 months ago

Seeing the same on a fresh install run as non-root user:

[12/26/23 21:29:19] INFO     Starting gunicorn 21.2.0
[12/26/23 21:29:19] INFO     Listening at: http://0.0.0.0:3003 (9)
[12/26/23 21:29:19] INFO     Using worker: app.config.CustomUvicornWorker
[12/26/23 21:29:19] INFO     Booting worker with pid: 13
There was a problem when trying to write in your cache folder (/cache). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
[12/26/23 21:29:23] WARNING  Matplotlib created a temporary cache directory at
                             /tmp/matplotlib-tmq8ybjj because the default path
                             (/.config/matplotlib) is not a writable directory;
                             it is highly recommended to set the MPLCONFIGDIR
                             environment variable to a writable directory, in
                             particular to speed up the import of Matplotlib and
                             to better support multiprocessing.
[12/26/23 21:29:28] INFO     Created in-memory cache with unloading after 300s
                             of inactivity.
[12/26/23 21:29:28] INFO     Initialized request thread pool with 4 threads.
[12/26/23 22:17:56] INFO     Downloading clip model 'ViT-B-32__openai'.This may
                             take a while.
[12/26/23 22:17:56] INFO     Downloading facial recognition model
                             'buffalo_l'.This may take a while.
[12/26/23 22:17:56] WARNING  Failed to load clip model
                             'ViT-B-32__openai'.Clearing cache and retrying.
[12/26/23 22:17:56] WARNING  Attempted to clear cache for model
                             'ViT-B-32__openai' but cache directory does not
                             exist.

Makes sense as everything (mounted volume /cache, start directory /usr/src/app, ...) in the container is owned by root.

mertalev commented 10 months ago

My only concern is that if we change the container to non-root, existing cache files would still be owned by root. The service should still be able to load these files (assuming the non-root user at least has read permission), but it won't be able to delete them or download new models until manual intervention.

We could make this more graceful by updating the cache folder's permissions in one version and changing to non-root in a later one. The goal would be to set the permissions such that the non-root user still has r/w access to the cache folder. The limitation of this is that it wouldn't help much in cases where there's a custom user set for the container, or where the admin leapfrogs straight to the non-root container.

Alternatively, we could also just bite the bullet and make this change, expecting users to run a command if they run into permission issues.

mmomjian commented 3 months ago

https://immich.app/docs/FAQ/#how-can-i-run-immich-as-a-non-root-user

Running ML as non root is supported as outlined in the FAQ above, so I am closing this issue for now.