immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
52.05k stars 2.76k forks source link

[BUG] Typesense resource usage escalation after face merge #3799

Closed raisinbear closed 1 year ago

raisinbear commented 1 year ago

The bug

Hi,

importing more and more images from my photo library, I'm running into some typesense related issues when merging faces to existing ones. For reference, I'm now at 4.5k photos and 56 visible and named faces (don't know how many faces in total, but I think I read something about 2200 somewhere in the output). Sometimes, faces aren't detected and I'd merge them into an existing person. Shortly after, I see something like the lines in the compose output below. Then, line Request to Node 0 failed due to "ECONNABORTED timeout of 10000ms exceeded" will repeat endlessly until, eventually, everything is under control again. During that time, the typesense data folder - usually around 40 - 70 M in size - will grow to over 2.5 G (edit: all that additional space is occupied by the raft log) and the respectivy container occupies a lot of CPU. I got a feeling it's getting worse the more photos / faces are added. To the point where it wouldn't recover today even after 30+ minutes. I then restarted the stack after resetting the typesense data volume and have been careful to not do many merge operations in quick succession. This only happens when merging faces. Hiding / recognizing doesn't seem to be of any concern.

immich_server            | [Nest] 6  - 08/20/2023, 10:17:51 AM     LOG [PersonService] Merging 6bf110c6-3ca2-42eb-9a24-f685b11cc0f1 into Julia
immich_server            | [Nest] 6  - 08/20/2023, 10:17:51 AM     LOG [PersonService] Merged 6bf110c6-3ca2-42eb-9a24-f685b11cc0f1 into Julia
immich_typesense         | I20230820 10:17:52.815428   212 raft_server.cpp:546] Term: 4, last_index index: 1045, committed_index: 1045, known_applied_index: 1045, applying_index: 0, queued_writes: 1, pending_queue_size: 0, local_sequence: 213750
immich_typesense         | I20230820 10:17:52.815778   268 raft_server.h:60] Peer refresh succeeded!
immich_microservices     | [Nest] 7  - 08/20/2023, 12:18:01 PM     LOG [SearchService] Indexing 2528 faces
immich_typesense         | I20230820 10:18:02.817299   212 raft_server.cpp:546] Term: 4, last_index index: 1048, committed_index: 1048, known_applied_index: 1047, applying_index: 1048, queued_writes: 0, pending_queue_size: 0, local_sequence: 216315
immich_typesense         | I20230820 10:18:02.822155   268 raft_server.h:60] Peer refresh succeeded!
immich_machine_learning  | INFO:     172.18.0.8:38292 - "POST /facial-recognition/detect-faces HTTP/1.1" 200 OK
immich_typesense         | I20230820 10:18:05.161114   213 batched_indexer.cpp:284] Running GC for aborted requests, req map size: 1
immich_typesense         | E20230820 10:18:06.582484    91 collection.cpp:1472] Document fetch error. Could not locate the JSON document for sequence ID: 36223
immich_typesense         | E20230820 10:18:06.583942    91 collection.cpp:1472] Document fetch error. Could not locate the JSON document for sequence ID: 36219
immich_typesense         | E20230820 10:18:06.584142    91 collection.cpp:1472] Document fetch error. Could not locate the JSON document for sequence ID: 36209
immich_typesense         | E20230820 10:18:06.584275    91 collection.cpp:1472] Document fetch error. Could not locate the JSON document for sequence ID: 36218
immich_typesense         | I20230820 10:18:12.845258   212 raft_server.cpp:546] Term: 4, last_index index: 1054, committed_index: 1054, known_applied_index: 1054, applying_index: 0, queued_writes: 8, pending_queue_size: 0, local_sequence: 216786
immich_typesense         | I20230820 10:18:12.845414   268 raft_server.h:60] Peer refresh succeeded!
immich_microservices     | Request #1692526686735: Request to Node 0 failed due to "ECONNABORTED timeout of 10000ms exceeded"
immich_microservices     | Request #1692526686735: Sleeping for 4s and then retrying request...
immich_microservices     | Request #1692526686863: Request to Node 0 failed due to "ECONNABORTED timeout of 10000ms exceeded"
immich_microservices     | Request #1692526686863: Sleeping for 4s and then retrying request...
immich_typesense         | E20230820 10:18:20.177382    96 collection.cpp:1472] Document fetch error. Could not locate the JSON document for sequence ID: 36525

The OS that Immich Server is running on

Debian

Version of Immich Server

v.1.74.0

Version of Immich Mobile App

v.1.74.0

Platform with the issue

Your docker-compose.yml content

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "immich" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: [ "start.sh", "microservices" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    depends_on:
      - redis
      - database
      - typesense
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always

  immich-web:
    container_name: immich_web
    image: ghcr.io/immich-app/immich-web:${IMMICH_VERSION:-release}
    env_file:
      - .env
    restart: always

  typesense:
    container_name: immich_typesense
    image: typesense/typesense:0.24.1@sha256:9bcff2b829f12074426ca044b56160ca9d777a0c488303469143dd9f8259d4dd
    environment:
      - TYPESENSE_API_KEY=${TYPESENSE_API_KEY}
      - TYPESENSE_DATA_DIR=/data
    volumes:
      - tsdata:/data
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2-alpine@sha256:70a7a5b641117670beae0d80658430853896b5ef269ccf00d1827427e3263fa3
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14-alpine@sha256:28407a9961e76f2d285dc6991e8e48893503cc3836a4755bbc2d40bcc272a441
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: ghcr.io/immich-app/immich-proxy:${IMMICH_VERSION:-release}
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL
      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    depends_on:
      - immich-server
      - immich-web
    restart: always

volumes:
  pgdata:
  model-cache:
  tsdata:

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=/home/immich/data

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secrets for postgres and typesense. You should change these to random passwords
TYPESENSE_API_KEY=some-random-text
DB_PASSWORD=postgres

# The values below this line do not need to be changed
###################################################################################
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

REDIS_HOSTNAME=immich_redis

Reproduction steps

1. Have a moderate amount of photos in your library.
2. Add more and let face recognition run.
3. Eventually, not all faces will be allocated to the correct existing person.
4. Do a number of face merges and observe compose output and typesense container behavior.

Additional information

No response

alextran1502 commented 1 year ago

After faces is merged, Typesense has to reindexes all faces to uodate the information. If you system CPU is on a slower side, it might take a while to do that. We are discussing different solutions to overcome this

alextran1502 commented 1 year ago

I will close this issue as this is expected. Thank you for brining this to our attention again

eygraber commented 1 year ago

I'm seeing something similar except my tsdata/db directory is 42GB and I see this additional line in the logs occasionally:

immich_microservices     | [Nest] 7  - 08/20/2023, 6:48:44 PM   ERROR [TypesenseRepository] Unable to index documents
raisinbear commented 1 year ago

I'm seeing something similar except my tsdata/db directory is 42GB and I see this additional line in the logs occasionally:

immich_microservices     | [Nest] 7  - 08/20/2023, 6:48:44 PM   ERROR [TypesenseRepository] Unable to index documents

Saw that line, too, at some point, but was unable to capture it when creating this issue.

@alextran1502 thanks for the info. I’m not worried about speed, rather about all the errors I’m seeing. Surprisingly, completely clearing the typesense data and restarting the stack is very fast and so far I couldn’t find anything not working with the search.

eygraber commented 1 year ago

Mine eventually resolved itself (although it happened again after merging more faces). Search seems to be kind of wonky, so maybe there's something to the errors.

raisinbear commented 1 year ago

Mine eventually resolved itself (although it happened again after merging more faces). Search seems to be kind of wonky, so maybe there's something to the errors.

Yeah, it eventually settles.

Experienced the same thing again and I suppose there is a bug happening: I thought it was coincidence before, but when the errors appear, I have a very good chance of being left with dozens of "new" faces that I am sure could have been matched very well to existing ones - or at least to each other for that matter. Before, I just merged them. Now, I had a look and most of these faces don't even have photos attached to them. The photos these unrecognized faces belong to, conversely, are appropriately categorized under existing (named) faces or at least one of the flood of new ones.

eygraber commented 1 year ago

I left mine running overnight and now tsdata is up to 32GB and the website doesn't load for a couple of minutes (503: Not Ready or Lagging).

raisinbear commented 1 year ago

I left mine running overnight and now tsdata is up to 32GB and the website doesn't load for a couple of minutes (503: Not Ready or Lagging).

Got this, too, now. Did you manage to resolve it or did it return to a working state again by itself?

eygraber commented 1 year ago

It kept growing no matter what I did. Then I scanned for missing faces in the admin jobs screen and that knocked it back down to ~6gb

gcarrarom commented 1 year ago

Just putting a note out here as my data also grew from 200mb to 10Gi overnight and still growing. I guess it should be fine once it's done, but I definitely didn't account for that. Probably good to add that somewhere in the documentation, what do you think?

raisinbear commented 1 year ago

At least fo my specific case, I found a solution. Opened discussion #3861 to get some more opinions 😅