Closed raisinbear closed 11 months ago
Is the issue reproducible multiple times?
We didn't change anything related to generating the thumbnails mechanism. It might be related to communicating with the database from the message you shared
immich_postgres | 2023-03-07 12:45:44.963 UTC [1598] LOG: could not receive data from client: Connection reset by peer
immich_postgres | 2023-03-07 12:45:44.964 UTC [1599] LOG: could not receive data from client: Connection reset by peer
Yeah, looks like the actual thumbnail generated is successfully, but saving the file to the database is failing with a timeout error.
We didn't change anything related to generating the thumbnails mechanism. It might be related to communicating with the database from the message you shared
immich_postgres | 2023-03-07 12:45:44.963 UTC [1598] LOG: could not receive data from client: Connection reset by peer immich_postgres | 2023-03-07 12:45:44.964 UTC [1599] LOG: could not receive data from client: Connection reset by peer
Yes it’s reproducible. At first I thought it was a hiccup and that my server was under load from elsewhere but I tried multiple times and it always happens when uploading many (don’t know where the limit is exactly) images at once. About the postgres message in the last two lines, I’ll check again tomorrow and see if I can supply more info. But apart from this issue, everything is running smooth and I didn’t notice anything pointing to an issue with the database or database container.
Ok, I did some more tests. I can reliably reproduce this on a Threadripper 24 Core machine in a Debian VM with fresh setup as well, when running a simple stress -c 24
during upload. When the CPUs are idle (on the Threadripper machine) during import, thumbnail generation and metadata extraction runs expectedly without issue.
I also tried to get more from the logs, but this is all I get:
@jrasm91, thanks for the inquiry. Right, as far as I understand these lines, any error during sharp
resizing process would be covered by the catch above and I never see this warning. However, I can't seem to follow the code much further to understand what exactly is supposed to be happening during the .save()
call and where the timeout could originate from.
I don't know if it helps but it seems that webp thumbnails are created sucessfully.
Thanks for the help!
Brief update, I was trying to understand how the processing worked. Can't say I fully do, but I changed the concurrency
setting in server/apps/microservices/src/processor.ts
for the JobName.GENERATE_JPEG_THUMBNAIL
and JobName.GENERATE_WEBP_THUMBNAIL
processes from 3 to 1 (lines 116 and 121 in current main
):
Also, I introduced a probably redundant concurrency: 1
in line 151 of [...]/src/processors/metadata-extraction.processor.ts
:
I transferred these changes directly into the according .js files in the microservices container on my raspberry pi and uploaded 16+ images at once - the same sequence of images that always failed before - several times (deleteing them in between). No timeouts 😀. I have no idea if that is symptomatic treatment. It doesn't seem like a viable root cause. But thumbnail generation and metadata extraction even succeed now while running a stress -c 4
+ forced heavy traffic from another dockerized service. The latter is anecdotal, as I only tried once, but before, roughly but reliably 1/4th of the jobs never completed even with all other services shut down..
Does that make any sense to you?
Dove a little deeper: as per bull documentation, concurrencies stack up:
* For each named processor, concurrency stacks up, so any of these three process functions * can run with a concurrency of 125. To avoid this behaviour you need to create an own queue * for each process function. */ const loadBalancerQueue = new Queue('loadbalancer'); loadBalancerQueue.process('requestProfile', 100, requestProfile); loadBalancerQueue.process('sendEmail', 25, sendEmail); loadBalancerQueue.process('sendInvitation', 0, sendInvite);
That means before my change it was doing 6 thumbnail generations in parallel. Plus 4 Metadata extractions if I calculated correctly (missing concurrency
specifier defaults to 1 say the docs). Plus 2 Video transcodings, if there are any (weren't in the tests before). I checked via added logger.warn()
line in media.service.ts
and indeed, with my double concurrency: 1
modification, two thumbnail generations are done in parallel. If setting concurrency
to 0 in line 121, thumbnail generations happen one after the other. Together with concurrency: 1
on videos this actually gave me an overall speedup of 30% over the modification above and video concurrency: 2
(this time 2 videos + 16 images).
I still don't know where the timeout originated from. I could speculate that jobs begin to stall because too many are running in parallel on a machine with not enough ressources, but this is just guesswork.
Timeout happens in my instance too. It is good if these values can be put in env.
I'm running into these timeouts as well very consistently. Can confirm that bull stacks the concurrency. On v1.52.0 it says 7 thumbnail tasks are running.
According to discord user mudone these errors may be a result of the database timeout.
According to discord user mudone these errors may be a result of the database timeout.
I tried changing that setting, too, but raising the timeout didn’t do anything for me. The timeout error might be symptomatic? I don’t understand the reason exactly, but with lowered concurrency the errors don’t occur in my instance.
The user on discord had success with a 60s timeout, but I do agree that it is probably more of a symptom. If things are running smoothly 10s should be plenty of time.
Maybe it's related to the cpu being swamped by the microservices container and throttling it's usage would help prevent the issue.
Maybe it's related to the cpu being swamped by the microservices container and throttling it's usage would help prevent the issue.
Right. How would you go about this other than lowering concurrency? At least for me there are no other services running anymore but apparently 7 thumbnail creations + the „small“ stuff like metadata extraction etc. in parallel is enough to exhaust cpu :/ even without videos coming in, which are by default processed in pairs, too.
My microservices has been running restricted, but I lessed these errors by expanding the resources available. I was not running into this before v1.50.
That being said I should probably run a test with nothing else running to make sure it is not a case of other services competing for the cpu cycles.
https://docs.docker.com/compose/compose-file/compose-file-v3/#resources
Wow, didn’t even think of that 🙈. Will try, but as @EnochPrime reports it doesn’t seem to resolve the issue but might actually make it worse. Could it have to do with stalling of the jobs? Sadly, I’ve no experience with bull, merely guessing from what I find 😐
I updated to v1.53.0 and also deployed to a node with more available resources. I am still seeing these errors, but the microservices container has not shutdown and it appears to be making progress.
I recently upgraded from v1.51.2 to v1.53.0 and ran the Generate Thumbs job due to the recent change in folder structure and I'm seeing these errors too. I also now have a bunch of missing thumbnails and full size images due to these errors. Is there anything I can do to ensure the jobs don't timeout and instead succeed? I'm also on a RasberryPi so resources might be limited, but I didn't see much stress on the system while the job was running. I'm wondering if my issue is more of a slow to write storage path than a resource (CPU/RAM) issue.
I'm wondering if my issue is more of a slow to write storage path than a resource (CPU/RAM) issue.
I'm no longer sure it's a slow storage location issue. I've volume mapped a much faster location for the the thumbs/...
path and I'm still receiving the "Connection Terminated due to connection timeout" error response which comes from this "Failed to generate thumbnail for asset" error message.
I resolved this issue by deploying on my desktop, which compared to the previous machine has the same memory but many more CPU resources available. All files remained on the previous machine and were accessed/written via a network share. So this seems CPU-bound instead of storage-related. Generating ~10k thumbnails took several hours of moderate CPU usage. Prior to using my desktop, I saw the same behavior as others: failed thumbnails, connection timeouts, and a persistently crashing microservices container.
My CPU sits there with hardly any usage while still getting these errors. It's as if Postgres just fell asleep or something because the timeouts are coming from the PG client:
Error: Connection terminated due to connection timeout
at Connection.<anonymous> (/usr/src/app/node_modules/pg/lib/client.js:132:73)
at Object.onceWrapper (node:events:641:28)
at Connection.emit (node:events:527:28)
at Socket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:57:12)
at Socket.emit (node:events:527:28)
at TCP.<anonymous> (node:net:709:12)
The other thing that confuses me is that attempting to Generate Thumbnails for only those that are missing seems to do nothing. It's as if the ones that are erroring are still getting marked as completed because nothing seems to be running when I click the "Missing" button for the Generate Thumbnails job.
@rhullah, as I think I wrote further up, I could only keep this in check with manual changes in the .js files in the microservices container to lower the overpowering level of concurrency. Doing that, I never got the issue again even on a RaspberryPi 2. However, this is only a temporary fix and the contrary of set and forget, as recreating the container / updating will undo the modifications. A stronger machine definitely helps, but I also experienced it on a RaspberryPi 4 a couple of times with the stock settings.
Do you get any successful thumbnails before the failures start @rhullah? I similarly saw little CPU usage when getting the errors and a seemingly useless "Missing" button.
Do you get any successful thumbnails before the failures start @rhullah? I similarly saw little CPU usage when getting the errors and a seemingly useless "Missing" button.
I seemed to generate a few successful thumbs then it would consistently have the timeout and throw error logs. Then after a longer time, it would seem that Postgres would wake up and it would start successfully creating thumbs again. As a result, some images in Immich would be missing thumbs (on the main library page) and missing the detailed image (when clicking on a specific item).
This is an issue after we add Typesense and rewrite the machine learning in Python, with the combined CPU usage of machine learning + video transcoding and thumbnail generation. If your CPU is not powerful enough, it will hog all the running processes and cannot be completed in time (the timeout notification). I am trying to think about how to manage the queue better so that it can help elevate this issue and let the slower/less powerful device runs all the jobs successfully, even with a slower completion time.
This is an issue after we add Typesense and rewrite the machine learning in Python, with the combined CPU usage of machine learning + video transcoding and thumbnail generation.
Would this be the case even if I have Machine Learning disabled? Because I do. I was getting restarts happening with the Machine Learning container (before I ran the template path job) so I disabled that container in the compose file and set it to false
in the .env
file.
And, does video transcoding occur in the "Generate Thumbnails" job? I'm not uploading new assets, only trying to "fix" the template paths so that they are in the new location.
@rhullah, as I think I wrote further up, I could only keep this in check with manual changes in the .js files in the microservices container to lower the overpowering level of concurrency. Doing that, I never got the issue again even on a RaspberryPi 2. However, this is only a temporary fix and the contrary of set and forget, as recreating the container / updating will undo the modifications. A stronger machine definitely helps, but I also experienced it on a RaspberryPi 4 a couple of times with the stock settings.
Yeah, I did notice that. I wasn't sure which file(s) was update where but I was trying to look into it. I wouldn't mind changing it, even temporarily, just to get past this update of the new template paths.
@rhullah, as I think I wrote further up, I could only keep this in check with manual changes in the .js files in the microservices container to lower the overpowering level of concurrency. Doing that, I never got the issue again even on a RaspberryPi 2. However, this is only a temporary fix and the contrary of set and forget, as recreating the container / updating will undo the modifications. A stronger machine definitely helps, but I also experienced it on a RaspberryPi 4 a couple of times with the stock settings.
Yeah, I did notice that. I wasn't sure which file(s) was update where but I was trying to look into it. I wouldn't mind changing it, even temporarily, just to get past this update of the new template paths.
If you’re interested in tinkering, some of the parallelism settings are in here: immich_microservices:/usr/src/app/dist/apps/microservices/apps/microservices/src/processors.js
The lower part of this file looks as follows for me:
decorate([ (0, bull_1.Process)({ name: domain_1.JobName.QUEUE_GENERATE_THUMBNAILS, concurrency: 1 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleQueueGenerateThumbnails", null); decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_JPEG_THUMBNAIL, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateJpegThumbnail", null); decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_JPEG_THUMBNAIL_DC, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateJpegThumbnail_dc", null); decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_WEBP_THUMBNAIL, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateWepbThumbnail", null); __decorate([ (0, bull_1.Process)({ name: domain_1.JobName.GENERATE_WEBP_THUMBNAIL_DC, concurrency: 0 }), metadata("design:type", Function), metadata("design:paramtypes", [Object]), metadata("design:returntype", Promise) ], ThumbnailGeneratorProcessor.prototype, "handleGenerateWepbThumbnail_dc", null); ThumbnailGeneratorProcessor = __decorate([ (0, bull_1.Processor)(domain_1.QueueName.THUMBNAIL_GENERATION), __metadata("design:paramtypes", [domain_1.MediaService]) ], ThumbnailGeneratorProcessor); exports.ThumbnailGeneratorProcessor = ThumbnailGeneratorProcessor; //# sourceMappingURL=processors.js.map
That is because bull processes stack up. So one is specified with concurrency 1, the others with 0. giving a total of 1 instead of the 7 introduced previously. There is much more than that, also in the „processors“ subdirectory. Some processors don’t have concurrency specified, so stack up by sheer number (default concurrency is 1). A couple of notes, though:
Thanks, I changed both GENERATE_JPEG_THUMBNAIL
and GENERATE_WEBP_THUMBNAIL
concurrency to 1
and then ran the job again. This time it was able to go through all the images/videos and generate thumbnails with no error. I have since restarted the container so that it reset the values back. I'll just keep an eye on the logs during sync and see if there's errors in the future with new uploads.
Just wanted to report out that I am also seeing this timeout issue (exact same errors as OP) when uploading and processing more than ~50 files at a time. Running v1.58.0 on Docker on a reasonably-fast Windows 10 machine (7th gen i7 @ 2.8GHz, 32GB RAM)
Changing all of the concurrencies to 1 in server/libs/domain/src/job/job.constants.ts
within the microservices app kept the CPU usage down and resolved the timeout issue. Limiting the CPU usage allowable for the microservices app in docker did not help.
It'd be really great if these concurrencies could be configured in the .env file instead of having to edit the source.
I just updated how jobs, handlers, queues, and concurrencies are configured in the server code. Maybe I can see if they can be dynamically re-configured at runtime now, which would mean they could be added to the administration > settings page.
I just updated how jobs, handlers, queues, and concurrencies are configured in the server code. Maybe I can see if they can be dynamically re-configured at runtime now, which would mean they could be added to the administration > settings page.
Thanks for putting this in via #2622. I will need to investigate how this helps for my deployment.
Ideally you could configure less jobs to run at a time, which seems to be a cause for the timeouts.
I'm closing this as there doesn't seem to be any activity on this issue, and it seems to be more or less resolved by the ability to change concurrency dynamically.
The bug
Hi,
As far as I'm aware this is new to one of the more recent releases as I haven't encountered this issue before. In short: when uploading several images - have tested with 10 .heic photos for instance - via mobile app / CLI / Web, I get quite the number of timeout errors from the microservices container à la:
Once CPU load goes down, a lot of thumbnails and metadata are missing. I assume this is in parts down to my server being generally slow to process the images and or utilized by other services at the time. At the same time, the timeout it is running into seems kind of overly strict. It isn't really that slow 😅. Also - even if a lot of thumbnails / metadata are still missing, I can trigger creation in the web ui and that always succeeds.
I'll try to formulate my questions / suggestions structurally coherent:
Thank you guys so much!
The OS that Immich Server is running on
Raspbian Buster (32-bit)
Version of Immich Server
v.1.50.1
Version of Immich Mobile App
v.1.50.0
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
Additional information
No response