immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
53.37k stars 2.82k forks source link

[BUG] Thumbnail and metadata extraction timeouts when uploading several images at once #1967

Closed raisinbear closed 11 months ago

raisinbear commented 1 year ago

The bug

Hi,

As far as I'm aware this is new to one of the more recent releases as I haven't encountered this issue before. In short: when uploading several images - have tested with 10 .heic photos for instance - via mobile app / CLI / Web, I get quite the number of timeout errors from the microservices container à la:

immich-immich-microservices-1  | [Nest] 1  - 03/07/2023, 12:46:24 PM   ERROR [MediaService] Failed to generate jpeg thumbnail for asset: f0929d05-f79a-4917-ad82-2cd9d9c45998
immich-immich-microservices-1  | Error: Connection terminated due to connection timeout
immich-immich-microservices-1  |     at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
immich-immich-microservices-1  |     at Object.onceWrapper (node:events:641:28)
immich-immich-microservices-1  |     at Connection.emit (node:events:527:28)
immich-immich-microservices-1  |     at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:57:12)
immich-immich-microservices-1  |     at Socket.emit (node:events:527:28)
immich-immich-microservices-1  |     at TCP.<anonymous> (node:net:709:12)
immich-immich-microservices-1  | [Nest] 1  - 03/07/2023, 12:45:40 PM   ERROR [MetadataExtractionProcessor] Error extracting EXIF Error: Connection terminated due to connection timeout
immich-immich-microservices-1  | Error: Connection terminated due to connection timeout
immich-immich-microservices-1  |     at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
immich-immich-microservices-1  |     at Object.onceWrapper (node:events:641:28)
immich-immich-microservices-1  |     at Connection.emit (node:events:527:28)
immich-immich-microservices-1  |     at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:57:12)
immich-immich-microservices-1  |     at Socket.emit (node:events:527:28)
immich-immich-microservices-1  |     at TCP.<anonymous> (node:net:709:12)
immich_postgres                | 2023-03-07 12:45:44.963 UTC [1598] LOG:  could not receive data from client: Connection reset by peer
immich_postgres                | 2023-03-07 12:45:44.964 UTC [1599] LOG:  could not receive data from client: Connection reset by peer

Once CPU load goes down, a lot of thumbnails and metadata are missing. I assume this is in parts down to my server being generally slow to process the images and or utilized by other services at the time. At the same time, the timeout it is running into seems kind of overly strict. It isn't really that slow 😅. Also - even if a lot of thumbnails / metadata are still missing, I can trigger creation in the web ui and that always succeeds.

I'll try to formulate my questions / suggestions structurally coherent:

Thank you guys so much!

The OS that Immich Server is running on

Raspbian Buster (32-bit)

Version of Immich Server

v.1.50.1

Version of Immich Mobile App

v.1.50.0

Platform with the issue

Your docker-compose.yml content

version: "3.8"

services:
  immich-server:
    container_name: immich_server
    image: altran1502/immich-server:release
    entrypoint: [ "/bin/sh", "./start-server.sh" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - redis
      - database
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: altran1502/immich-server:release
    entrypoint: [ "/bin/sh", "./start-microservices.sh" ]
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - redis
      - database
    restart: always

  immich-web:
    container_name: immich_web
    image: altran1502/immich-web:release
    entrypoint: [ "/bin/sh", "./entrypoint.sh" ]
    env_file:
      - .env
    restart: always

  redis:
    container_name: immich_redis
    image: redis:6.2
    restart: always

  database:
    container_name: immich_postgres
    image: postgres:14
    env_file:
      - .env
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      PG_DATA: /var/lib/postgresql/data
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  immich-proxy:
    container_name: immich_proxy
    image: altran1502/immich-proxy:release
    environment:
      # Make sure these values get passed through from the env file
      - IMMICH_SERVER_URL
      - IMMICH_WEB_URL
    ports:
      - 2283:8080
    logging:
      driver: none
    depends_on:
      - immich-server
    restart: always

volumes:
  pgdata:

Your .env content

###################################################################################
# Database
###################################################################################

DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_PASSWORD=postgres
DB_DATABASE_NAME=immich

# Optional Database settings:
# DB_PORT=5432

###################################################################################
# Redis
###################################################################################

REDIS_HOSTNAME=immich_redis

# Optional Redis settings:
# REDIS_PORT=6379
# REDIS_DBINDEX=0
# REDIS_PASSWORD=
# REDIS_SOCKET=

###################################################################################
# Upload File Location
#
# This is the location where uploaded files are stored.
###################################################################################

UPLOAD_LOCATION=/home/immich/data

###################################################################################
# Reverse Geocoding
#
# Reverse geocoding is done locally which has a small impact on memory usage
# This memory usage can be altered by changing the REVERSE_GEOCODING_PRECISION variable
# This ranges from 0-3 with 3 being the most precise
# 3 - Cities > 500 population: ~200MB RAM
# 2 - Cities > 1000 population: ~150MB RAM
# 1 - Cities > 5000 population: ~80MB RAM
# 0 - Cities > 15000 population: ~40MB RAM
####################################################################################

# DISABLE_REVERSE_GEOCODING=false
# REVERSE_GEOCODING_PRECISION=3

####################################################################################
# WEB - Optional
#
# Custom message on the login page, should be written in HTML form.
# For example:
# PUBLIC_LOGIN_PAGE_MESSAGE="This is a demo instance of Immich.<br><br>Email: <i>demo@demo.de</i><br>Password: <i>demo</i>"
####################################################################################

PUBLIC_LOGIN_PAGE_MESSAGE=

####################################################################################
# Alternative Service Addresses - Optional
#
# This is an advanced feature for users who may be running their immich services on different hosts.
# It will not change which address or port that services bind to within their containers, but it will change where other services look for their peers.
# Note: immich-microservices is bound to 3002, but no references are made
####################################################################################

IMMICH_WEB_URL=http://immich-web:3000
IMMICH_SERVER_URL=http://immich-server:3001
IMMICH_MACHINE_LEARNING_URL=http://immich-machine-learning:3003

####################################################################################
# Alternative API's External Address - Optional
#
# This is an advanced feature used to control the public server endpoint returned to clients during Well-known discovery.
# You should only use this if you want mobile apps to access the immich API over a custom URL. Do not include trailing slash.
# NOTE: At this time, the web app will not be affected by this setting and will continue to use the relative path: /api
# Examples: http://localhost:3001, http://immich-api.example.com, etc
####################################################################################

#IMMICH_API_URL_EXTERNAL=http://localhost:3001

Reproduction steps

1. Setup immich on a not too fast machine 😉
2. Upload many images at once, mobile, web, cli, however you like. 
3. Observe timeout errors in log during metadata extraction / thumbnail creation

Additional information

No response

alextran1502 commented 1 year ago

Is the issue reproducible multiple times?

alextran1502 commented 1 year ago

We didn't change anything related to generating the thumbnails mechanism. It might be related to communicating with the database from the message you shared

immich_postgres                | 2023-03-07 12:45:44.963 UTC [1598] LOG:  could not receive data from client: Connection reset by peer
immich_postgres                | 2023-03-07 12:45:44.964 UTC [1599] LOG:  could not receive data from client: Connection reset by peer
jrasm91 commented 1 year ago

Yeah, looks like the actual thumbnail generated is successfully, but saving the file to the database is failing with a timeout error.

image

raisinbear commented 1 year ago

We didn't change anything related to generating the thumbnails mechanism. It might be related to communicating with the database from the message you shared

immich_postgres                | 2023-03-07 12:45:44.963 UTC [1598] LOG:  could not receive data from client: Connection reset by peer
immich_postgres                | 2023-03-07 12:45:44.964 UTC [1599] LOG:  could not receive data from client: Connection reset by peer

Yes it’s reproducible. At first I thought it was a hiccup and that my server was under load from elsewhere but I tried multiple times and it always happens when uploading many (don’t know where the limit is exactly) images at once. About the postgres message in the last two lines, I’ll check again tomorrow and see if I can supply more info. But apart from this issue, everything is running smooth and I didn’t notice anything pointing to an issue with the database or database container.

raisinbear commented 1 year ago

Ok, I did some more tests. I can reliably reproduce this on a Threadripper 24 Core machine in a Debian VM with fresh setup as well, when running a simple stress -c 24 during upload. When the CPUs are idle (on the Threadripper machine) during import, thumbnail generation and metadata extraction runs expectedly without issue.

I also tried to get more from the logs, but this is all I get:

More log lines ``` immich-immich-microservices-1 | [Nest] 1 - 03/08/2023, 5:32:43 AM ERROR [MetadataExtractionProcessor] Error extracting EXIF Error: Connection terminated due to connection timeout immich-immich-microservices-1 | Error: Connection terminated due to connection timeout immich-immich-microservices-1 | at Connection. (/usr/src/app/node_modules/pg/lib/client.js:132:73) immich-immich-microservices-1 | at Object.onceWrapper (node:events:641:28) immich-immich-microservices-1 | at Connection.emit (node:events:527:28) immich-immich-microservices-1 | at Socket. (/usr/src/app/node_modules/pg/lib/connection.js:57:12) immich-immich-microservices-1 | at Socket.emit (node:events:527:28) immich-immich-microservices-1 | at TCP. (node:net:709:12) immich-immich-microservices-1 | [Nest] 1 - 03/08/2023, 5:32:46 AM ERROR [MetadataExtractionProcessor] Error extracting EXIF Error: Connection terminated due to connection timeout immich-immich-microservices-1 | Error: Connection terminated due to connection timeout immich-immich-microservices-1 | at Connection. (/usr/src/app/node_modules/pg/lib/client.js:132 :73) immich-immich-microservices-1 | at Object.onceWrapper (node:events:641:28) immich-immich-microservices-1 | at Connection.emit (node:events:527:28) immich-immich-microservices-1 | at Socket. (/usr/src/app/node_modules/pg/lib/connection.js:57:12) immich-immich-microservices-1 | at Socket.emit (node:events:527:28) immich-immich-microservices-1 | at TCP. (node:net:709:12) immich_postgres | 2023-03-08 05:32:51.830 UTC [109] LOG: could not receive data from client: Connection reset by peer immich-immich-microservices-1 | [Nest] 1 - 03/08/2023, 5:32:57 AM ERROR [MetadataExtractionProcessor] Error extracting EXIF Error: Connection terminated due to connection timeout immich-immich-microservices-1 | Error: Connection terminated due to connection timeout immich-immich-microservices-1 | at Connection. (/usr/src/app/node_modules/pg/lib/client.js:132:73) immich-immich-microservices-1 | at Object.onceWrapper (node:events:641:28) immich-immich-microservices-1 | at Connection.emit (node:events:527:28) immich-immich-microservices-1 | at Socket. (/usr/src/app/node_modules/pg/lib/connection.js:57:12) immich-immich-microservices-1 | at Socket.emit (node:events:527:28) immich-immich-microservices-1 | at TCP. (node:net:709:12) immich_postgres | 2023-03-08 05:32:57.690 UTC [110] LOG: could not receive data from client: C onnection reset by peer immich_postgres | 2023-03-08 05:32:57.693 UTC [112] LOG: could not receive data from client: Connection reset by peer immich_postgres | 2023-03-08 05:32:57.698 UTC [111] LOG: could not receive data from client: Connection reset by peer immich_postgres | 2023-03-08 05:32:57.716 UTC [113] LOG: could not receive data from client: Connection reset by peer immich-immich-microservices-1 | [Nest] 1 - 03/08/2023, 5:32:58 AM ERROR [MediaService] Failed to generate jpeg thumbnail for asset: 5f49be63-dc67-49e0-a98a-8bddff6a49f4 immich-immich-microservices-1 | Error: Connection terminated due to connection timeout immich-immich-microservices-1 | at Connection. (/usr/src/app/node_modules/pg/lib/client.js:132:73) immich-immich-microservices-1 | at Object.onceWrapper (node:events:641:28) immich-immich-microservices-1 | at Connection.emit (node:events:527:28) immich-immich-microservices-1 | at Socket. (/usr/src/app/node_modules/pg/lib/connection.js:57:12) immich-immich-microservices-1 | at Socket.emit (node:events:527:28) immich-immich-microservices-1 | at TCP. (node:net:709:12) immich-immich-microservices-1 | [Nest] 1 - 03/08/2023, 5:32:58 AM ERROR [MetadataExtractionProcessor] Error extracting EXIF Error: Connection terminated due to connection timeout immich-immich-microservices-1 | Error: Connection terminated due to connection timeout immich-immich-microservices-1 | at Connection. (/usr/src/app/node_modules/pg/lib/client.js:132:73) immich-immich-microservices-1 | at Object.onceWrapper (node:events:641:28) immich-immich-microservices-1 | at Connection.emit (node:events:527:28) immich-immich-microservices-1 | at Socket. (/usr/src/app/node_modules/pg/lib/connection.js:57:12) immich-immich-microservices-1 | at Socket.emit (node:events:527:28) immich-immich-microservices-1 | at TCP. (node:net:709:12) immich-immich-microservices-1 | [Nest] 1 - 03/08/2023, 5:32:58 AM ERROR [MediaService] Failed to generate jpeg thumbnail for asset: f4943d55-c87e-4b56-88fd-fe866ee2c534 immich-immich-microservices-1 | Error: Connection terminated due to connection timeout immich-immich-microservices-1 | at Connection. (/usr/src/app/node_modules/pg/lib/client.js:132:73) immich-immich-microservices-1 | at Object.onceWrapper (node:events:641:28) immich-immich-microservices-1 | at Connection.emit (node:events:527:28) immich-immich-microservices-1 | at Socket. (/usr/src/app/node_modules/pg/lib/connection.js:57:12) immich-immich-microservices-1 | at Socket.emit (node:events:527:28) immich-immich-microservices-1 | at TCP. (node:net:709:12) immich-immich-microservices-1 | [Nest] 1 - 03/08/2023, 5:32:58 AM ERROR [MetadataExtractionProcessor] Error extracting EXIF Error: Connection terminated due to connection timeout immich-immich-microservices-1 | Error: Connection terminated due to connection timeout immich-immich-microservices-1 | at Connection. (/usr/src/app/node_modules/pg/lib/client.js:132:73) immich-immich-microservices-1 | at Object.onceWrapper (node:events:641:28) immich-immich-microservices-1 | at Connection.emit (node:events:527:28) immich-immich-microservices-1 | at Socket. (/usr/src/app/node_modules/pg/lib/connection.js:57:12) immich-immich-microservices-1 | at Socket.emit (node:events:527:28) immich-immich-microservices-1 | at TCP. (node:net:709:12) ```

@jrasm91, thanks for the inquiry. Right, as far as I understand these lines, any error during sharp resizing process would be covered by the catch above and I never see this warning. However, I can't seem to follow the code much further to understand what exactly is supposed to be happening during the .save() call and where the timeout could originate from.

I don't know if it helps but it seems that webp thumbnails are created sucessfully.

Thanks for the help!

raisinbear commented 1 year ago

Brief update, I was trying to understand how the processing worked. Can't say I fully do, but I changed the concurrency setting in server/apps/microservices/src/processor.ts for the JobName.GENERATE_JPEG_THUMBNAIL and JobName.GENERATE_WEBP_THUMBNAIL processes from 3 to 1 (lines 116 and 121 in current main):

grafik

Also, I introduced a probably redundant concurrency: 1 in line 151 of [...]/src/processors/metadata-extraction.processor.ts:

grafik

I transferred these changes directly into the according .js files in the microservices container on my raspberry pi and uploaded 16+ images at once - the same sequence of images that always failed before - several times (deleteing them in between). No timeouts 😀. I have no idea if that is symptomatic treatment. It doesn't seem like a viable root cause. But thumbnail generation and metadata extraction even succeed now while running a stress -c 4 + forced heavy traffic from another dockerized service. The latter is anecdotal, as I only tried once, but before, roughly but reliably 1/4th of the jobs never completed even with all other services shut down..

Does that make any sense to you?

raisinbear commented 1 year ago

Dove a little deeper: as per bull documentation, concurrencies stack up:


* For each named processor, concurrency stacks up, so any of these three process functions
* can run with a concurrency of 125. To avoid this behaviour you need to create an own queue
* for each process function.
*/
const loadBalancerQueue = new Queue('loadbalancer');
loadBalancerQueue.process('requestProfile', 100, requestProfile);
loadBalancerQueue.process('sendEmail', 25, sendEmail);
loadBalancerQueue.process('sendInvitation', 0, sendInvite);

That means before my change it was doing 6 thumbnail generations in parallel. Plus 4 Metadata extractions if I calculated correctly (missing concurrency specifier defaults to 1 say the docs). Plus 2 Video transcodings, if there are any (weren't in the tests before). I checked via added logger.warn() line in media.service.ts and indeed, with my double concurrency: 1 modification, two thumbnail generations are done in parallel. If setting concurrency to 0 in line 121, thumbnail generations happen one after the other. Together with concurrency: 1 on videos this actually gave me an overall speedup of 30% over the modification above and video concurrency: 2 (this time 2 videos + 16 images). I still don't know where the timeout originated from. I could speculate that jobs begin to stall because too many are running in parallel on a machine with not enough ressources, but this is just guesswork.

penguinsam commented 1 year ago

Timeout happens in my instance too. It is good if these values can be put in env.

EnochPrime commented 1 year ago

I'm running into these timeouts as well very consistently. Can confirm that bull stacks the concurrency. On v1.52.0 it says 7 thumbnail tasks are running.

EnochPrime commented 1 year ago

According to discord user mudone these errors may be a result of the database timeout.

https://github.com/immich-app/immich/blob/c584791b65c88bfc327cfbc55407502362897f14/server/libs/infra/src/database.config.ts#L21

raisinbear commented 1 year ago

According to discord user mudone these errors may be a result of the database timeout.

https://github.com/immich-app/immich/blob/c584791b65c88bfc327cfbc55407502362897f14/server/libs/infra/src/database.config.ts#L21

I tried changing that setting, too, but raising the timeout didn’t do anything for me. The timeout error might be symptomatic? I don’t understand the reason exactly, but with lowered concurrency the errors don’t occur in my instance.

EnochPrime commented 1 year ago

The user on discord had success with a 60s timeout, but I do agree that it is probably more of a symptom. If things are running smoothly 10s should be plenty of time.

jrasm91 commented 1 year ago

Maybe it's related to the cpu being swamped by the microservices container and throttling it's usage would help prevent the issue.

raisinbear commented 1 year ago

Maybe it's related to the cpu being swamped by the microservices container and throttling it's usage would help prevent the issue.

Right. How would you go about this other than lowering concurrency? At least for me there are no other services running anymore but apparently 7 thumbnail creations + the „small“ stuff like metadata extraction etc. in parallel is enough to exhaust cpu :/ even without videos coming in, which are by default processed in pairs, too.

jrasm91 commented 1 year ago

https://docs.docker.com/compose/compose-file/compose-file-v3/#resources

EnochPrime commented 1 year ago

My microservices has been running restricted, but I lessed these errors by expanding the resources available. I was not running into this before v1.50.

EnochPrime commented 1 year ago

That being said I should probably run a test with nothing else running to make sure it is not a case of other services competing for the cpu cycles.

raisinbear commented 1 year ago

https://docs.docker.com/compose/compose-file/compose-file-v3/#resources

Wow, didn’t even think of that 🙈. Will try, but as @EnochPrime reports it doesn’t seem to resolve the issue but might actually make it worse. Could it have to do with stalling of the jobs? Sadly, I’ve no experience with bull, merely guessing from what I find 😐

EnochPrime commented 1 year ago

I updated to v1.53.0 and also deployed to a node with more available resources. I am still seeing these errors, but the microservices container has not shutdown and it appears to be making progress.

rhullah commented 1 year ago

I recently upgraded from v1.51.2 to v1.53.0 and ran the Generate Thumbs job due to the recent change in folder structure and I'm seeing these errors too. I also now have a bunch of missing thumbnails and full size images due to these errors. Is there anything I can do to ensure the jobs don't timeout and instead succeed? I'm also on a RasberryPi so resources might be limited, but I didn't see much stress on the system while the job was running. I'm wondering if my issue is more of a slow to write storage path than a resource (CPU/RAM) issue.

rhullah commented 1 year ago

I'm wondering if my issue is more of a slow to write storage path than a resource (CPU/RAM) issue.

I'm no longer sure it's a slow storage location issue. I've volume mapped a much faster location for the the thumbs/... path and I'm still receiving the "Connection Terminated due to connection timeout" error response which comes from this "Failed to generate thumbnail for asset" error message.

Gatherix commented 1 year ago

I resolved this issue by deploying on my desktop, which compared to the previous machine has the same memory but many more CPU resources available. All files remained on the previous machine and were accessed/written via a network share. So this seems CPU-bound instead of storage-related. Generating ~10k thumbnails took several hours of moderate CPU usage. Prior to using my desktop, I saw the same behavior as others: failed thumbnails, connection timeouts, and a persistently crashing microservices container.

rhullah commented 1 year ago

My CPU sits there with hardly any usage while still getting these errors. It's as if Postgres just fell asleep or something because the timeouts are coming from the PG client:

Error: Connection terminated due to connection timeout
  at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
  at Object.onceWrapper (node:events:641:28)
  at Connection.emit (node:events:527:28)
  at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:57:12)
  at Socket.emit (node:events:527:28)
  at TCP.<anonymous> (node:net:709:12)

The other thing that confuses me is that attempting to Generate Thumbnails for only those that are missing seems to do nothing. It's as if the ones that are erroring are still getting marked as completed because nothing seems to be running when I click the "Missing" button for the Generate Thumbnails job.

raisinbear commented 1 year ago

@rhullah, as I think I wrote further up, I could only keep this in check with manual changes in the .js files in the microservices container to lower the overpowering level of concurrency. Doing that, I never got the issue again even on a RaspberryPi 2. However, this is only a temporary fix and the contrary of set and forget, as recreating the container / updating will undo the modifications. A stronger machine definitely helps, but I also experienced it on a RaspberryPi 4 a couple of times with the stock settings.

Gatherix commented 1 year ago

Do you get any successful thumbnails before the failures start @rhullah? I similarly saw little CPU usage when getting the errors and a seemingly useless "Missing" button.

rhullah commented 1 year ago

Do you get any successful thumbnails before the failures start @rhullah? I similarly saw little CPU usage when getting the errors and a seemingly useless "Missing" button.

I seemed to generate a few successful thumbs then it would consistently have the timeout and throw error logs. Then after a longer time, it would seem that Postgres would wake up and it would start successfully creating thumbs again. As a result, some images in Immich would be missing thumbs (on the main library page) and missing the detailed image (when clicking on a specific item).

alextran1502 commented 1 year ago

This is an issue after we add Typesense and rewrite the machine learning in Python, with the combined CPU usage of machine learning + video transcoding and thumbnail generation. If your CPU is not powerful enough, it will hog all the running processes and cannot be completed in time (the timeout notification). I am trying to think about how to manage the queue better so that it can help elevate this issue and let the slower/less powerful device runs all the jobs successfully, even with a slower completion time.

rhullah commented 1 year ago

This is an issue after we add Typesense and rewrite the machine learning in Python, with the combined CPU usage of machine learning + video transcoding and thumbnail generation.

Would this be the case even if I have Machine Learning disabled? Because I do. I was getting restarts happening with the Machine Learning container (before I ran the template path job) so I disabled that container in the compose file and set it to false in the .env file.

And, does video transcoding occur in the "Generate Thumbnails" job? I'm not uploading new assets, only trying to "fix" the template paths so that they are in the new location.

rhullah commented 1 year ago

@rhullah, as I think I wrote further up, I could only keep this in check with manual changes in the .js files in the microservices container to lower the overpowering level of concurrency. Doing that, I never got the issue again even on a RaspberryPi 2. However, this is only a temporary fix and the contrary of set and forget, as recreating the container / updating will undo the modifications. A stronger machine definitely helps, but I also experienced it on a RaspberryPi 4 a couple of times with the stock settings.

Yeah, I did notice that. I wasn't sure which file(s) was update where but I was trying to look into it. I wouldn't mind changing it, even temporarily, just to get past this update of the new template paths.

raisinbear commented 1 year ago

@rhullah, as I think I wrote further up, I could only keep this in check with manual changes in the .js files in the microservices container to lower the overpowering level of concurrency. Doing that, I never got the issue again even on a RaspberryPi 2. However, this is only a temporary fix and the contrary of set and forget, as recreating the container / updating will undo the modifications. A stronger machine definitely helps, but I also experienced it on a RaspberryPi 4 a couple of times with the stock settings.

Yeah, I did notice that. I wasn't sure which file(s) was update where but I was trying to look into it. I wouldn't mind changing it, even temporarily, just to get past this update of the new template paths.

If you’re interested in tinkering, some of the parallelism settings are in here: immich_microservices:/usr/src/app/dist/apps/microservices/apps/microservices/src/processors.js

The lower part of this file looks as follows for me:

decorate([ (0, bull_1.Process)({ name: domain_1.JobName.QUEUE_GENERATE_THUMBNAILS, concurrency: 1 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleQueueGenerateThumbnails", null); decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_JPEG_THUMBNAIL, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateJpegThumbnail", null); decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_JPEG_THUMBNAIL_DC, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateJpegThumbnail_dc", null); decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_WEBP_THUMBNAIL, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateWepbThumbnail", null); __decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_WEBP_THUMBNAIL_DC, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateWepbThumbnail_dc", null); ThumbnailGeneratorProcessor = __decorate([ (0, bull_1.Processor)(domain_1.QueueName.THUMBNAIL_GENERATION), __metadata("design:paramtypes", [domain_1.MediaService]) ], ThumbnailGeneratorProcessor); exports.ThumbnailGeneratorProcessor = ThumbnailGeneratorProcessor; //# sourceMappingURL=processors.js.map

That is because bull processes stack up. So one is specified with concurrency 1, the others with 0. giving a total of 1 instead of the 7 introduced previously. There is much more than that, also in the „processors“ subdirectory. Some processors don’t have concurrency specified, so stack up by sheer number (default concurrency is 1). A couple of notes, though:

rhullah commented 1 year ago

Thanks, I changed both GENERATE_JPEG_THUMBNAIL and GENERATE_WEBP_THUMBNAIL concurrency to 1 and then ran the job again. This time it was able to go through all the images/videos and generate thumbnails with no error. I have since restarted the container so that it reset the values back. I'll just keep an eye on the logs during sync and see if there's errors in the future with new uploads.

wittymap commented 1 year ago

Just wanted to report out that I am also seeing this timeout issue (exact same errors as OP) when uploading and processing more than ~50 files at a time. Running v1.58.0 on Docker on a reasonably-fast Windows 10 machine (7th gen i7 @ 2.8GHz, 32GB RAM)

Changing all of the concurrencies to 1 in server/libs/domain/src/job/job.constants.ts within the microservices app kept the CPU usage down and resolved the timeout issue. Limiting the CPU usage allowable for the microservices app in docker did not help.

It'd be really great if these concurrencies could be configured in the .env file instead of having to edit the source.

jrasm91 commented 1 year ago

I just updated how jobs, handlers, queues, and concurrencies are configured in the server code. Maybe I can see if they can be dynamically re-configured at runtime now, which would mean they could be added to the administration > settings page.

EnochPrime commented 1 year ago

I just updated how jobs, handlers, queues, and concurrencies are configured in the server code. Maybe I can see if they can be dynamically re-configured at runtime now, which would mean they could be added to the administration > settings page.

Thanks for putting this in via #2622. I will need to investigate how this helps for my deployment.

jrasm91 commented 1 year ago

Ideally you could configure less jobs to run at a time, which seems to be a cause for the timeouts.

mertalev commented 11 months ago

I'm closing this as there doesn't seem to be any activity on this issue, and it seems to be more or less resolved by the ability to change concurrency dynamically.