Clustering fails for large number of faces

WayneBooth commented 7 months ago

Expected behaviour

Consumed resources should be related to the number of new faces to process, not the number of faces already processed. Adding 1 new face should take the same processing power, time and memory to process power, the first time a face is processed vs a new face in a database of 99million.

Actual behaviour

Clustering take exponentially longer the more images are added to the system. So that when there are around 100k images processed, clustering can take in excess of 4-5 hours, (or fail) to add a single new face.

Steps to reproduce

1.Perform face recognition on images in 10k batches 2.See that the next run, the clusting steps take exponentially longer 3.After 100k images, the clusting will start to fail.

Server configuration

Operating system: Linux nas 5.15.0-1032-realtime #35-Ubuntu SMP PREEMPT_RT Tue Jan 24 11:45:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Logs

None logged

Background task log with debug.

sudo -u apache php occ -vvv face:background_job

``` 1/8 - Executing task CheckRequirementsTask (Check all requirements) 2/8 - Executing task CheckCronTask (Check that service is started from either cron or from command) 3/8 - Executing task DisabledUserRemovalTask (Purge all the information of a user when disable the analysis.) 4/8 - Executing task StaleImagesRemovalTask (Crawl for stale images (either missing in filesystem or under .nomedia) and remove them from DB) 5/8 - Executing task CreateClustersTask (Create new persons or update existing persons) Face clustering will be recreated with new information or changes 65831 faces found for clustering Killed ```

matiasdelellis commented 5 months ago

Hi @WayneBooth Unfortunately everything you say is true. 😞 I am looking for how to optimize these cases, but today it is not a progressive clustering and the time and memory consumption may be excessive.

vwbusguy commented 1 month ago

This was fixed for me with #712 in the new release. The sweet spot for me was setting the batch to 20k. It runs much faster with a lot less resources and the clusters are closer to without batching.

matiasdelellis / facerecognition