Closed stalker314314 closed 1 year ago
Hi @stalker314314
I was playing with this (using foreach(), count() outside of for(), array_push(), array_merge(), references), but I not found great improvements.. :disappointed:
I only found one small detail...
$j must be initialized in $i+1, to avoid adding ''$edges[] = array($i, $i);'
In your case, it saves 12000 comparisons... Maybe you win one second.. :sweat_smile:
Nope, unfortunately, even that is not possible: https://github.com/davisking/dlib/blob/f7f6f6761817f2e6e5cf10ae4235fc5742779808/tools/python/src/face_recognition.cpp#L245 (maybe it is, but I don't want to "experiment":D)
I seen it, but since use pre-increment (++i) there, I thought it worked like this... which is incorrect.. :disappointed: I think I remember that before, the behavior of the for() change if you used, pre or post increment.. :confused:
Obviously, I'm wrong.. :sweat_smile: Sorry. Ignore my comments. :wink:
Well... 73% of the time, it is due to the calculation of the Euclidean distance.
To analyze it yourself:
We can add another table where keep the distances between all the faces? It can be progressively filled ... and reused continuously.
userId | Face1? | Face2? | Distance |
---|---|---|---|
matias | 1 | 2 | 0.4346 |
matias | 1 | 3 | 0.6654 |
matias | 1 | ... | ... |
matias | 1 | 6775 | 0.157 |
matias | 1 | ... | ... |
matias | n | 2 | 0.747 |
p.s: this reminds me of my first database, and tempts me to add some api to force that two faces are equal by setting this 'distance' to 0 to improve #114 :wink: :see_no_evil:
I think, once we move distance calculation to pdlib, it will be lot faster. This is what python dlib is doing, anyway: https://github.com/davisking/dlib/blob/ae406bf4c119c3f6bfc8992a6bbf54d4f579fad5/tools/python/src/face_recognition.cpp#L252
Now, if that doesn't put us below 5 mins ballpark for ~20000 images...let's "cache":)
This DB would have 400M rows for 20.000 images?:D
Wow, Soooo easy.. :sweat_smile:
Now, if that doesn't put us below 5 mins ballpark for ~20000 images...let's "cache":) This DB would have 400M rows for 20.000 images?:D
MMm.. rows = n(n+1)/2 so, if n=20.000 faces => 200M rows.. :disappointed_relieved: that is still crazy. I had made some calculations, and this left us a limit of about 200,000 faces, to overcome the limit of integer in innodb. :open_mouth: I do not know if I did it right, but it's still crazy.. :sweat_smile: At most we should keep the edges that meet the threhold.
Hi, Now that pdlib officially incorporated the euclidean distance calc (https://github.com/goodspb/pdlib/releases/tag/v1.0.2), and really add photos to make more real tests, this is the result..
PHP Euclidean Class:
[matias@nube nextcloud]$ time sudo -u apache php occ face:background_job -u matias -vvv
6/10 - Executing task CreateClustersTask (Create new persons or update existing persons)
Found 0 faces without associated persons for user matias and model 4
Clusters already exist, but there was some change that requires recreating the clusters
9820 faces found for clustering
5439 persons found after clustering
real 19m52,195s
user 19m45,111s
sys 0m0,582s
Pdlib dlib_vector_length() function
[matias@nube nextcloud]$ time sudo -u apache php occ face:background_job -u matias -vvv
6/10 - Executing task CreateClustersTask (Create new persons or update existing persons)
Found 0 faces without associated persons for user matias and model 4
Clusters already exist, but there was some change that requires recreating the clusters
9820 faces found for clustering
5439 persons found after clustering
real 2m34,903s
user 2m30,510s
sys 0m0,497s
Wow.. :smile:
awesome! this was my bottleneck (with 25k photos)
I guess that this can be considered closed, since the bottleneck today is another. 😅
Thanks for everything! 😄
Edge calculation is done prior to giving edges to
dlib_chinese_whispers
, in this for loop: https://github.com/matiasdelellis/facerecognition/blob/71c8238ba534e572fa9c26942078cab0ba9b2660/lib/BackgroundJob/Tasks/CreateClustersTask.php#L208-L219However, I have, for example, 12000 faces and calculation takes 20-30 minutes, which is far more than 15 minutes that Nextcloud advises for cron jobs (and in some long future, we want to be able to finish any operation in 15 minutes). This is too long. Not sure what solution is, maybe we will need to copy array of all descriptors down to pdlib and do edge calculation there, we should try it out and see if that would speed it up.
I guess this is not for first version, maybe in future