Open sarunas-zilinskas opened 1 day ago
Thanks for the report! I think this is basically the same as #11981, right?
Did some further investigation and here's what I found: Started the face detection again with GPU enabled ML.
From these metrics you can see that memory usage is jumping up and up. That happens every time GPU is out of resources. Looks like a memory leak to me. Now the CPU graph is at 0% when container inside is at idle after GPU is out of resources. I guess it somewhat tries to restart but fails because of allocated memory is not released. Now looking into I/O usage, we can see sudden increase in reads, that's when model is loaded into memory, but that happens only once, I guess when it tries to reload model into memory it should switch to CPU but for some reason it does not. Then after a while of stuck I guess in a limbo, network usage shows again increase in RX where it receives assets for processing but it throws back 500's and the graph goes up much quicker than it did before. This implies that assets are "processed" but in fact they are not and eventually when I did check my library - lots of faces are missing.
Just to add some technical info: I am using integrated GPU, intel UHD 605, host has 32GB of ram so there is plenty of memory which can be used. There is bunch of other services/containers running so it may skew the results somewhat but I did check the grafana for host monitoring and it does not get even close to 50% usage:
The annotations are for reference when I restarted both (server and ML) containers and started running face detection and 2nd annotation is when I got GPU out of resources error.
The bug
Hi, first of all immich is an amazing platform for storing images, period! Thanks to everyone involved!
Seems like I have noticed an issue with immich job queuing. If machine learning container fails for some reason but not catastrophically, seems like immich does not handle 500 responses coming back from machine learning container. Here's what happens step by step:
Immich server starts running face detection job
Machine learning container fails (in my example, I presume there is a memory leak because after a while of restarting the container I get error: " [GPU] out of GPU resources " And this happens after processing several hundred of photos. However I am not 100% certain if this is a memory leak or some photo is heavy on load and it runs out of GPU resources. Nevertheless that's out of scope of the issue I am referring to and it's a whole separate issue)
ML container fails but not completely. On every request from server, ML container gives response as http 500 which is being logged on server:
Side note: Openvino should fall back to CPU processing but seems like that does not happen. Once again this is out of scope of this issue but including this to understand the issue better.
Immich server in UI shows as amount of assets in queue processed as number is decreasing too fast till it hits 0. Side note: I have definitely pressed the "all" button on face detection. This indicates that face detection has not been processed.
Immich server "flags" assets as processed however in fact they have not been processed because ML container shat its pants and gave back 500's on the rest of requests. Queue is down to 0 yet no faces are available in UI:
I have reproduced it 2 times and results are the same. One time some of the faces have been processed (I guess those photos which have been detected before out of GPU resources issue occurs), the next time no faces have been available.
The OS that Immich Server is running on
Debian 11 - docker
Version of Immich Server
v1.115
Version of Immich Mobile App
not relevant
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
...
Relevant log output
No response
Additional information
I guess this is relevant to smart search as well. But I did not test it. Also this seems to be relevant as well: https://github.com/immich-app/immich/discussions/6347