Closed haldi4803 closed 11 months ago
There is obviously some type of memory issue. The fetch failed also usually means machine learning url is misconfigured or the container is unreachable. Can you confirm that the container is up and machine learning works for a single photo, when you regenerate the thumbnail for it?
uhm... okay...
I've paused everything in the Job Status. Everything is still normal now. Then i continue the "Generate Thumbnails" and boom. ram goes full.
Server:
[Nest] 6 - 11/23/2023, 4:19:08 PM LOG [CommunicationRepository] Websocket Disconnect: wXLrI6T_AHBx-iU1AAAD
[Nest] 6 - 11/23/2023, 5:50:58 PM LOG [CommunicationRepository] Websocket Connect: qhRn9z0Uzv7ljOtWAAAF
[Nest] 6 - 11/23/2023, 5:51:58 PM LOG [CommunicationRepository] Websocket Disconnect: qhRn9z0Uzv7ljOtWAAAF
[Nest] 6 - 11/23/2023, 5:51:58 PM LOG [CommunicationRepository] Websocket Connect: rPCHd_E5BrfkQJxyAAAH
[Nest] 6 - 11/23/2023, 5:52:56 PM LOG [CommunicationRepository] Websocket Disconnect: rPCHd_E5BrfkQJxyAAAH
[Nest] 6 - 11/23/2023, 5:52:56 PM LOG [CommunicationRepository] Websocket Connect: meMbMJLW3yn23yz7AAAJ
Microservices:
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [NestFactory] Starting Nest application...
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] TypeOrmModule dependencies initialized +46ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] BullModule dependencies initialized +0ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] ConfigHostModule dependencies initialized +1ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] DiscoveryModule dependencies initialized +0ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] ScheduleModule dependencies initialized +0ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] ConfigModule dependencies initialized +7ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] BullModule dependencies initialized +0ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] BullModule dependencies initialized +0ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] TypeOrmCoreModule dependencies initialized +334ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] TypeOrmModule dependencies initialized +0ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] InfraModule dependencies initialized +3ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] DomainModule dependencies initialized +22ms
[Nest] 7 - 11/23/2023, 5:52:35 PM LOG [InstanceLoader] MicroservicesModule dependencies initialized +0ms
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [MetadataService] Initialized local reverse geocoder with cities500
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [SearchService] Running bootstrap
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [TypesenseRepository] Schema up to date: assets/assets-v10
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [TypesenseRepository] Schema up to date: albums/albums-v2
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [TypesenseRepository] Schema up to date: faces/faces-v1
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [TypesenseRepository] Alias mapping: [{"collection_name":"faces-v1","name":"faces"},{"collection_name":"albums-v2","name":"albums"},{"collection_name":"assets-v10","name":"assets"}]
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [TypesenseRepository] Collections needing migration: {"assets":false,"albums":false,"faces":false}
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [NestApplication] Nest application successfully started +65ms
[Nest] 7 - 11/23/2023, 5:52:48 PM LOG [ImmichMicroservice] Immich Microservices is listening on http://[::1]:3002 [v1.88.2] [PRODUCTION]
[Nest] 7 - 11/23/2023, 5:55:46 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset 7f1266f4-98c6-4b61-8eaa-7d53765fc806
[Nest] 7 - 11/23/2023, 5:55:47 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset fc3e14ea-5230-45d9-a4e1-0577075231c6
[Nest] 7 - 11/23/2023, 5:55:56 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset bdfd3b89-79a5-41b4-a2ca-aaf10bb055ef
[Nest] 7 - 11/23/2023, 5:55:56 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset 4fd3beca-27b4-4114-b4bd-cbd7958e0c20
[Nest] 7 - 11/23/2023, 5:56:10 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset 78a51376-2434-49f5-8439-0105c66d7a1f
[Nest] 7 - 11/23/2023, 5:56:11 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset a5086c9c-2ebc-48e7-89eb-8b4680368fb2
[Nest] 7 - 11/23/2023, 5:56:55 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset 912da659-c434-4aa8-b4ec-92933b15cb23
[Nest] 7 - 11/23/2023, 5:56:59 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset aa59021e-c7a7-4329-ac79-3abfe42a8134
[Nest] 7 - 11/23/2023, 5:56:59 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset 5f8a2c1d-de13-43f9-b049-16f789179ef0
[Nest] 7 - 11/23/2023, 5:57:07 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset d5950662-7efb-49b2-b842-267d43707471
[Nest] 7 - 11/23/2023, 5:57:22 PM LOG [MediaService] Successfully generated JPEG image thumbnail for asset 9bdd0b9d-2455-479f-9934-796438877054
Machine-Learning
[11/23/23 15:32:57] INFO Starting gunicorn 21.2.0
[11/23/23 15:32:57] INFO Listening at: http://0.0.0.0:3003 (9)
[11/23/23 15:32:57] INFO Using worker: uvicorn.workers.UvicornWorker
[11/23/23 15:32:58] INFO Booting worker with pid: 17
[11/23/23 15:34:58] CRITICAL WORKER TIMEOUT (pid:17)
[11/23/23 15:34:59] ERROR Worker (pid:17) was sent SIGKILL! Perhaps out of
memory?
[11/23/23 15:34:59] INFO Booting worker with pid: 26
[11/23/23 15:36:40] INFO Created in-memory cache with unloading after 300s
of inactivity.
[11/23/23 15:36:40] INFO Initialized request thread pool with 8 threads.
[11/23/23 15:45:47] INFO Loading facial recognition model 'buffalo_l'
[11/23/23 15:45:52] INFO Loading image classification model
'microsoft/resnet-50'
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
/opt/venv/lib/python3.11/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead.
warnings.warn(
[11/23/23 15:45:54] INFO Loading clip model 'ViT-B-32__openai'
[11/23/23 15:54:14] INFO Shutting down due to inactivity.
[11/23/23 15:54:22] ERROR Worker (pid:26) exited with code 1
[11/23/23 15:54:22] ERROR Worker (pid:26) exited with code 1.
[11/23/23 15:54:22] INFO Booting worker with pid: 70
[11/23/23 15:54:47] INFO Created in-memory cache with unloading after 300s
of inactivity.
[11/23/23 15:54:47] INFO Initialized request thread pool with 8 threads.
Anything usefull in there?
I'm also a user and my library is around 600gb.
From your logs, it looks like ML is still only working on image categorization. What I often run into is face recognition and merging when the load will be huge.
Thumbnail work shouldn't cause the memory to run out, I've run into repeated restarts of typesense and massive memory usage before.
Suggestion:
i Think that might have happened because while immich was creating thumbnails i deleted the @eaDir folders from my NAS. I had 114'000 images in the library instead of the 18'000 actually existing.
Might that bug happen if you try to access a file that does not exist anymore?
Did a full wipe and clean start, does not seem to happen anymore.
Today, I encountered the same issue. After bulk importing around 10,000 photos, I suddenly couldn't remotely connect to my NAS host. Initially, I suspected that the ML service was causing high memory usage during facial recognition, so I forcibly restarted the NAS and reconfigured concurrency settings, setting all jobs to run on a single thread. I didn't start all the jobs at once; instead, I sequentially ran FACE DETECTION and then GENERATE THUMBNAILS. There were no abnormalities during the FACE DETECTION process, indicating that the issue wasn't caused by the ML operations. After completing all FACE DETECTIONS, I initiated the GENERATE THUMBNAILS task, which ran slowly due to the large image sizes. Without continuous monitoring, upon returning to check on the NAS, I found it was unreachable remotely, with the system log displaying the following error messages:
[17452.279485] Out of memory: Killed process 24344 (immich_microser) total-vm:14375660kB, anon-rss:8827768kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:23516kB oom_score_adj:0
[17653.570542] Out of memory: Killed process 24972 (immich_microser) total-vm:13119600kB, anon-rss:8825796kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:20652kB oom_score_adj:0
[17848.345248] Out of memory: Killed process 26162 (immich_microser) total-vm:19086052kB, anon-rss:8709080kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:31688kB oom_score_adj:0
[17932.668353] Out of memory: Killed process 27348 (immich_microser) total-vm:13774836kB, anon-rss:8759228kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:22648kB oom_score_adj:0
[18001.034451] Out of memory: Killed process 27902 (immich_microser) total-vm:14749840kB, anon-rss:9111036kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:25104kB oom_score_adj:0
This suggests that there might be a memory leak issue within the immich_microser
service. I would greatly appreciate any advice on how to address this problem.
Hello, I've been having this same issue for a couple of weeks. I learned you could tweak facial recognition, so on a fresh install I let all the jobs finish, then tweaked the settings and re ran the jobs one at a time.
Now, I run into this exact issue every couple of days.. my RAM and SWAP trickle up to 100% when a few photos are uploaded, then the Immich container stops functioning until restarted. I've tried assigning more SWAP and RAM, but it only prolongs the issue.
My fix for now is just restarting the container when it happens, I don't have anything else in the container so it's an isolated issue, the rest of my self hosted stuff keeps operating normally.
Happy to share logs or my config files, they're mostly default. I'm not sure how to access the logs.
Cheers
The bug
Hello,
I've setup an Immich Container with an existing Library of about 150gb so i was expecting heavy load. But after 1-2 Days i realised that my whole system was running on overload and not a single docker container could be reached. The system was busy to such an extend that an SSH connection took minutes to respond! After a reboot i realized that the immich-microservices filled the RAM and the Swap. Whenever this happened again i simply killed the immich-microservices docker and after a few minutes the system came back alive. Tried to limit RAM usage im Docker Container. That did not Help... the RAM in the Docker Container itself does NOT go over the Limit! But the System Still uses shittons of RAM anyway. So i upgraded from 8 to 16gb RAM... tought that would Help. But nope. It just took a little Longer until it happened.
Here is a screenshot that shows docker stats which are within Limits, hTop which shows full Sawp and CPU usage. And the Synology DSM RAM usage. Shorty before the system becomes unresponsive.
Here is a Screenshot one minute prior.
Here is a snipet from the Log after i used docker kill to shutdown the container.
The Memory Limitation on the Container itself seems to be working. But is it the interContainer transport of such Huge data that causes issues?
Log from Machine-Learning.
Log from Server
Any other logs i should provide ?
The OS that Immich Server is running on
Synology DSM 7.2.1
Version of Immich Server
1.88.2
Version of Immich Mobile App
v0
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
Additional information
No response