immich-app / immich

High performance self-hosted photo and video management solution.
https://immich.app
GNU Affero General Public License v3.0
49.21k stars 2.58k forks source link

Unassigned faces start showing up later #10277

Closed diyoyo closed 4 months ago

diyoyo commented 4 months ago

The bug

First, I'm sorry because I don't really understand what "Unassigned faces" is all about. I couldn't find anything in the documentation, but I just realized it is related to the following problem:

I feel like this is a bug: I can never predict the amount of work it's going to take me to tag everyone, since it keeps adding up. Why is a face unassigned in the first place? Why are some faces detected afterwards that were not detected the first time? Are there some threshold that I meet only after I reduced the number of unknown people?

I played a little bit with the settings to understand things better, but even using min-cluster-size of 1 and asking immich to display all available people in the people tab (by editing repositories/person.repository.js ), I still have some new people showing up after I have spent time tagging people. I also tried with a smaller library (18GB instead of 500GB), but the behavior is the same, to a smaller extent, or course, but still.

Thanks for the help as I am very confused.

The OS that Immich Server is running on

Debian 12

Version of Immich Server

v1.106.3

Version of Immich Mobile App

v1.106.3

Platform with the issue

Your docker-compose.yml content

Not relevant

Your .env content

Not relevant. no modification of the settings

Reproduction steps

- Face Detection and Face Clustering is performed
- I start tagging faces, merging people, hiding others, etc.
- There are some pics where not everyone is available for tagging: either because it is labelled as "Unassigned face" or just because it was not detected (i guess)
- When I'm done with tagging, I run Face Detection and Face Clustering again
- Then new people appear that are available for tagging: either from the unassigned faces or from the pool that was missed before.

Relevant log output

No response

Additional information

No response

bo0tzz commented 4 months ago

Are the background jobs still running?

diyoyo commented 4 months ago

Are the background jobs still running?

I haven't checked from within the containers whether a process is hanging, all I can say is that I always wait for all jobs to be at 0 to start my "tagging sessions".

mertalev commented 4 months ago

Why is a face unassigned in the first place?

If it's unassigned, it means it a) has fewer similar faces than the min recognized faces setting and b) none of the faces it matched are associated with a person either.

Why are some faces detected afterwards that were not detected the first time?

Facial recognition happens in two phases. The first phase is meant to be fast and the second phase runs at night to make it more complete. Running "missing" essentially ran the second phase. I can see why this would be confusing after you just set things up; there's room for improvement here.

Edit: this doesn't apply as much if you run All for facial recognition. It should be close to finalized in this case, with maybe a few more faces that could get recognized in a Missing job.

Are there some threshold that I meet only after I reduced the number of unknown people?

No, there's nothing like this.

diyoyo commented 4 months ago

If it's unassigned, it means it a) has fewer similar faces than the min recognized faces setting and b) none of the faces it matched are associated with a person either.

Thanks for the explanation @mertalev . Then I don't get why these faces had been unassigned, since my min recognized faces is 1.

Let's take the example of yearly classroom pictures of my mom's childhood. The lighting is good and everyone shows up with kinda "equal opportunity" to get their face detected, since they're facing the camera and are all still. Well, I started hiding all the other classmates, since my mom had not been detected. Basically I was hoping that she would get detected during the "Missing" run. I don't know for how long she actually had been "unassigned", but her cluster is pretty well populated so I don't see any reason for the recognition to miss her...

I was starting to build the hypothesis that if too many people are hidden on a picture, then the leftovers would be "unassigned" as well, but your reply seems to discard this hypothesis.

mertalev commented 4 months ago

Clustering depends on a ton of variables. In this case it'd depend on how many photos of her child self are in the library, how many of those were associated with her, what the lighting, angles, colors and resolutions were for those faces, etc. All of that can affect how similar it thinks a face is, and if it's further than the distance threshold it won't be a match.

diyoyo commented 4 months ago

Regardless of this, given your previous reply, how can a face be unsassigned if my min recognized face is 1 ? It should just be considered as yet another unnamed people, shouldn't it?

mertalev commented 4 months ago

Was it set to 1 from the start? Or did you change it to 1 afterwards? It should make a person for her as long as facial recognition ran with the new setting.

mertalev commented 4 months ago

Looking at the code, I have a hunch for what might have happened in this case.

First can you share the output of docker exec immich_postgres psql -U postgres -c "SELECT * FROM pg_vector_index_stat;" immich?

diyoyo commented 4 months ago

Was it set to 1 from the start? Or did you change it to 1 afterwards? It should make a person for her as long as facial recognition ran with the new setting.

That might be an explanation. There has been so many changes in immich over the past 10 months, and I have done so many more on the organisation of my library...

I am pretty sure the min=1 was never a default option, so you're right, the likelihood that I have changed it after some faces were unassigned already is high. Then the likelihood that the web UI for unassigned faces was not yet implemented is also high, and it would explain why I missed this icon in the first place, in the info pane.

I'll share with you the output of the sql query in a minute, my RPi is hanging pulling the latest docker image.

diyoyo commented 4 months ago

docker exec immich_postgres psql -U postgres -c "SELECT * FROM pg_vector_index_stat;" immich

 tablerelid | indexrelid |  tablename   | indexname  | idx_status | idx_indexing | idx_tuples | idx_sealed | idx_growing | idx_write | idx_size  |                                                                                                                                             idx_options                                                                                                                                              
------------+------------+--------------+------------+------------+--------------+------------+------------+-------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     143885 |    1133821 | smart_search | clip_index | NORMAL     | f            |     200555 | {200555}   | {}          |         0 | 439014872 | {"vector":{"dimensions":512,"distance":"Cos","kind":"F32"},"segment":{"max_growing_segment_size":20000,"max_sealed_segment_size":1000000},"optimizing":{"sealing_secs":60,"sealing_size":1,"optimizing_threads":2},"indexing":{"hnsw":{"m":16,"ef_construction":300,"quantization":{"trivial":{}}}}}
      16913 |    1420991 | asset_faces  | face_index | NORMAL     | t            |     353293 | {353270}   | {1,22}      |         0 | 771715520 | {"vector":{"dimensions":512,"distance":"Cos","kind":"F32"},"segment":{"max_growing_segment_size":20000,"max_sealed_segment_size":1000000},"optimizing":{"sealing_secs":60,"sealing_size":1,"optimizing_threads":2},"indexing":{"hnsw":{"m":16,"ef_construction":300,"quantization":{"trivial":{}}}}}

Also, is there a way to un-unassign all unassigned faces? I looked for a field in the "person" table, but could only find isHidden . And I couldn't find anything either in the asset_faces's fields. Thanks.

mertalev commented 4 months ago

To clarify, it should still apply the new setting on unassigned faces when you run a Missing job and for new assets (and of course if you re-ran Facial Recognition on all assets).

My hypothesis is that the vector index made a boo-boo and didn't return any results for a search (i.e. it didn't match the face against itself, which should never happen unless the index misses it). The code doesn't handle this case so it checks if 0 matches is greater than or equal to 1 (the minimum face setting) and decides the face should be unassigned instead of creating a new person.

Try running REINDEX INDEX face_index;, wait for it to finish (check the pg_vector_index_stat query that idx_indexing is f or just check cpu usage) and run a Missing job. If I'm right, this will probably change the result. If not, dropping the index would certainly confirm whether it's an index issue but the facial recognition would take longer.

diyoyo commented 4 months ago

Ok, I will try in a moment. There are two points I'd like to raise first:

  1. As we speak, I realize I must confess the following:
    • I did some pretty crazy migration last month:
    • I wanted to move a few of my external library files in a different folder.
    • To be on the safe side, I duplicated them in the new location.
    • On Immich, I created a new external library with the new path.
    • Then, with the help of ChatGPT, ran a ("badass", imho) query to automatically assign faces and albums to the new assets. Then removed the old external library.
    • Well, actually, it didn't go as smoothly as described, because for some of them I didn't have the luxury to copy the files due to space limits. And for others, I didn't do the steps in the right order, I guess.

Do you think it could have affected this issue ? All I know so far, is that it messed up the metadata of some files: the bounding box of some faces is correct in the webUI, but the thumbnail is taken from the reversed aspect ratio, so you don't see the real face for unnamed people, and the small pic preview for that person is also with a reversed aspect ratio. Yet the max-size pic preview has a correct aspect ratio. Anyway, that's for another github issue.

  1. Before I run the REINDEX: what would be you're definition of this will probably change the result ? As we speak, the results are changing every time I run "Missing", with new people always getting discovered. Should we agree on some metrics before I run the test?

So far, my favorite tracker is the following:

\out 240613_beforeScreening.txt
SELECT p."name", COUNT(1) as c 
FROM person p
INNER JOIN asset_faces af
        ON af."personId"=p."id"
WHERE p."name"<>''
GROUP BY p."name"
ORDER BY c DESC, p."name"; 
\out

and then

colordiff -u 240613_beforeScreening.txt 240613_afterScreening.txt

And if I want to get a number that closely matches the "Total number of people" of the webUI, I run:

SELECT COUNT(1) FROM 
(
    SELECT p."id", COUNT(1) as c 
    FROM person p
        INNER JOIN asset_faces af
               ON af."personId"=p."id" 
     WHERE p."isHidden"='f' 
              AND p."thumbnailPath"<>'' 
    GROUP BY p."id"
    ORDER BY c DESC
) as q
WHERE q."c" > 0;

But surprisingly, I was not able to match exactly the number that is displayed in the UI (and have been too lazy reading the source code at the moment)

mertalev commented 4 months ago

Do you think it could have affected this issue?

It depends on the query, but it certainly could.

Should we agree on some metrics before I run the test?

The ones you shared are fine. Another one that would be nice to look at it is grouping by person and number of faces. Something like

SELECT af.”personId”, COUNT(*) face_count
FROM asset_faces af
GROUP BY af.”personId”
ORDER BY face_count DESC;

That should include unassigned faces too (in the form of null).

diyoyo commented 4 months ago

Ok, then I will mesure this as well before and after.

So, just to be clear and maximise the value of the experiment, I will:

Correct?

mertalev commented 4 months ago

Yup, but no need to run it for face detection.

diyoyo commented 4 months ago

Alright. Well, there has been a lot of changes. Basically the range is way bigger than in the past.

Some more context Historically, two weeks ago, I did a big tagging/hiding session, went from 10k people to less than 6k. After a new facial recognition, it went back up to 20k. That was the biggest jump. But as I kept reducing the available pool of people, of course the jump was lower and lower. This morning, I reached the 4000 faces milestone, and it went up 4295 before we had this conversation. So, small jumps, and very few edits to the number of pics per tagged person.

Reindexing Results

  1. Query 1 cat 240613_beforeScreening_2218.txt | wc -l1378 (yes, I know, that's a lot, but I like to tag famous people that are detected on the newspapers or the tv screens 🤣🤣)

diff 240613_beforeScreening_2218.txt 240613_afterScreening_2317.txt | grep "@@" | wc -l44 (with pretty big change blocks at the top of the pyramid)

  1. Query 2 diff 240613_beforeScreening_count_2219.txt 240613_afterScreening_count_2318.txt
    count 
    -------
    -  4295
    +  8673
    (1 row)

And from the web UI: from 4265 to 8643 (I don't know why there is this 30-people diff)

  1. Query 3

cat 240613_beforeScreening_withUnassigned_2220.txt | wc -l22133 cat 240613_afterScreening_withUnassigned_2318.txt | wc -l26517 diff 240613_beforeScreening_withUnassigned_2220.txt 240613_afterScreening_withUnassigned_2318.txt | grep "@@" | wc -l216

  1. Conclusions I'll let you draw the conclusions. tbh, I'm not sure whether the impact comes from the REINDEXING, or whether some magics happens when I reach a low enough amount of unnamed people (meaning: maybe at this point, my device resources are sufficient for the ML to discover more blurry people, hence the leap), but this is pure speculation.

My main concerns are:

EDIT: diff is an alias for colordiff -u

diyoyo commented 4 months ago

Another query which I use to focus first on the pics that needs the most hiding is the following:

SELECT a."originalPath", COUNT(1) as c FROM assets a
INNER JOIN asset_faces af
ON af."assetId"=a."id"
INNER JOIN person p
ON p."id"=af."personId"
WHERE p."isHidden"='f' AND p."name"='' GROUP BY a."originalPath"
ORDER BY c DESC
LIMIT 50; 

Before the reindexing, the pic with the biggest amount of unnamed people had about 24 and the count fell down to 14 afterward, and quickly to 4. Now, the max number is 31 and I'm still at 9 after reaching the LIMIT 50.

mertalev commented 4 months ago

So the number of recognized people doubled? Wow, that's actually pretty remarkable. Did that photo of your mom get recognized now? Also, can you re-run the pg_vector_index_stat query?

diyoyo commented 4 months ago

I had fixed that picture of my mom already this morning, after discovering that the unassigned icon was actually hiding pretty interesting faces.

After the reindexing ended (I ran it directly in a bash session, so it was a blocking action), it still took a little bit of time before the f value appeared for face_index

Now, since my previous message, I already started hiding again the pics with most unnamed people, and I'm happy because it really seems like everyone is recognized on the picture. So maybe you solved it by making me REINDEX. Maybe my tagging sessions were forcing the reindexing of only a selected set of pics, which would explain why it would find more people afterwards on these pics, but not all pics?

The result of the pg_vector_index_stat is :


 tablerelid | indexrelid |  tablename   | indexname  | idx_status | idx_indexing | idx_tuples | idx_sealed | idx_growing | idx_write | idx_size  |                                                                                                                                             idx_options                                                                                                                                              
------------+------------+--------------+------------+------------+--------------+------------+------------+-------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     143885 |    1133821 | smart_search | clip_index | NORMAL     | f            |     200557 | {200557}   | {}          |         0 | 439029072 | {"vector":{"dimensions":512,"distance":"Cos","kind":"F32"},"segment":{"max_growing_segment_size":20000,"max_sealed_segment_size":1000000},"optimizing":{"sealing_secs":60,"sealing_size":1,"optimizing_threads":2},"indexing":{"hnsw":{"m":16,"ef_construction":300,"quantization":{"trivial":{}}}}}
      16913 |    1420991 | asset_faces  | face_index | NORMAL     | f            |     151468 | {151468}   | {}          |         0 | 339634152 | {"vector":{"dimensions":512,"distance":"Cos","kind":"F32"},"segment":{"max_growing_segment_size":20000,"max_sealed_segment_size":1000000},"optimizing":{"sealing_secs":60,"sealing_size":1,"optimizing_threads":2},"indexing":{"hnsw":{"m":16,"ef_construction":300,"quantization":{"trivial":{}}}}}
(2 rows)```

Could you please explain what value you're looking at in this `pg_vector_index_stat` query? I'd like to understand more.
diyoyo commented 4 months ago

Good news:

Bad news:

mertalev commented 4 months ago

Could you please explain what value you're looking at in this pg_vector_index_stat query? I'd like to understand more.

The idx_tuples field is the most interesting. Because of how Postgres handles concurrency, changing a row actually creates another copy of that row and inserts it again into each index.

Before you reindexed, you had over twice as many rows in the vector index as you do now, likely because of assigning and reassigning people. Since more than half of the whole index was duplicates, this damaged the structure of the graph and contributed to lower recall.

just found out that there are still unassigned faces on some pics, which I still don't understand knowing that min_cluster_size=1.

For more science, you can try dropping the index with DROP INDEX face_index and re-running. This will give you perfect, exact recall.

diyoyo commented 4 months ago

Thanks for everything. I don't know whether at the end of the day, this qualifies as a bug or not, but in the meantime, i'll reindex now and then, and hopefully I won't become crazy after each big jump in the number of persons left to process.

mertalev commented 4 months ago

No problem! If you do try to run it without an index, do let me know the results. The difference will show the quality of the index in general. If there’s a big difference, it might be worth tuning the index settings for higher quality.

diyoyo commented 4 months ago

I won't try it soon. This really was exhausting, I need a rest. But I'll let you know if I do.

Just so you know, the cron job started and surprisingly (well, you might say it's the normal behavior), while the Library Tasks became green, couting down to 0, the "Facial Detection" and "Facial Recognition" never started (or so quickly that I didnot notice). In the past two weeks, I would systematically see them at work at cron time.

So, in addition to having my closest people back on track in the top list of people to tag (yep, another positive side effect of the reindexing), you might also have prevented my low-cost SSD from dying too soon 👍

diyoyo commented 4 months ago

@mertalev It's a bit out-of-topic but still related, and I'm not sure it deserves a new issue, so I'll post it here. Now that the number of faces does not increase all the time, I'm still surprised that no clustering is actually happening anymore. My goal was to display all the faces so I could balance the sizes of the clusters as much as possible, so it would weigh more during the clustering and finally attract some people that are less represented. (I'm not sure this is 'clear', so I'll rephrase: I wanted to manually pick pictures of less-represented people to grow clusters and give the ML algo more data for the smaller classes). But it feels like I'll have to cherry pick all these rare pics, as I don't see any recognition really happening.

I'm not a complotist, but I easily make hypotheses 🤣. Please debunk the following (and if possible, forward me to the piece of code that would help me understand) :

These questions may sound weird, but I'm really confused here. Sorry.

diyoyo commented 4 months ago

I think a cool feature would be the "Are these people the same?". Or a field manualTag in asset_faces that would allow for preservation of the cluster when performing a full reclustering. Or an option to recluster fully, except for named persons.

mertalev commented 4 months ago

The Recognition job does not consider all possible "named" classes when going through all the unnamed faces. There is a threshold. True of False?

For each unassigned face, it searches for similar faces within the distance threshold. Among those matches, it will try to find a face with an associated person. The person of the most similar face with a person will be chosen when there are multiple matches.

or... the recognition job does not go through unnamed people, it goes through unclustered faces. Since my min=1, it means that if a face-bounding-box has not been attributed an existing+named cluster in the first place, it will create a new+unnamed cluster (ie person), but will never try to re-cluster that face ever again, since it already sits in a cluster of size 1. True or False?

The "missing" job will go through faces without an assigned person. If a face has a person assigned, then it won't be queued anymore. There's no point in queueing it since it already has a match. Other faces that haven't been recognized are still queued and can match against the person of that face if they're similar enough.

diyoyo commented 4 months ago

Ok, I'll interpret this as "False" and "True", then. True, meaning that when min-cluster-size=1, there should be no unassigned face after the first round. So a face in a cluster of 1 has no chance of being reclustered anymore.

Then, I'll start a new "Feature Request" after making my own tests :

diyoyo commented 4 months ago

Would the following be enough to trigger a re-classification (after hitting "Recognition : Missing") of the currently unnamed people with cluster size=1 ? Should I edit the asset_job_status too?

-- Start a transaction
BEGIN;

-- Select the personId to delete
WITH persons_to_delete AS (
    SELECT p."id"
    FROM person p
    JOIN asset_faces af ON p."id" = af."personId"
    WHERE p."name" = ''
    GROUP BY p."id"
    HAVING COUNT(af."id") = 1
)

-- Update the personId in asset_faces to an empty string
UPDATE asset_faces
SET "personId" = ''
WHERE "personId" IN (SELECT "id" FROM persons_to_delete);

-- Delete records in person
DELETE FROM person
WHERE "id" IN (SELECT "id" FROM persons_to_delete);

-- Commit the transaction
COMMIT;
mertalev commented 4 months ago

Can I ask what you're trying to achieve with this first? Are you looking to re-run it with different settings or something?

diyoyo commented 4 months ago

Well, I used min-cluster-size=1 to be able to go pick rare faces of persons when they were younger etc. Now that I've done this, I'm pretty sure I missed some pics of the same person at the same age, given the number of pics that I have. Our conversation highlighted the fact that even if I manually grow the cluster of one given person, no re-clustering of these 1-face-clusters will happen, because the job is considered done by definition. So Now that I've done a lot of haystack-needle picking already, I just want to remove all those 1-face-clusters (more than 3500) that have no name, re-run the recognition and hope that some of them will fall within distance threshold of the labelled clusters.

To answer your question, I'm not changing the settings, I've already changed one parameter: the starting point, ie, the content of the labelled clusters, which will affect the distance calculation. I just need artifical help to finish the job and find the leftovers.

mertalev commented 4 months ago

The queueing happens for faces, not people. Any faces that haven't been recognized will continue to be queued and can be matched with those 1-face people. There's nothing to gain from removing those people and re-running with the same settings. They wouldn't be in their own cluster if they could match those other clusters to begin with.

The only gain I suppose would be that the index should be cleaner now than before, so perhaps some of those faces could match existing people.

diyoyo commented 4 months ago

The queueing happens for faces, not people. Any faces that haven't been recognized will continue to be queued and can be matched with those 1-face people. There's nothing to gain from removing those people and re-running with the same settings.

I'm sorry I'm very confused. I understand one thing and the opposite when reading your answer. Please forgive my english, I guess it adds some confusion too.

Let's try again: You're saying "Any faces that haven't been recognized" ... well, if a singleton cluster exists, it means that the face associated with it has been recognized, and hence, labelled or not, it won't be queued anymore. So keeping the association between the face and the cluster, to my understanding, is what qualifies the face as "in the queue" or "not in the queue", right? Therefore the above SQL query, where I delete the link between a face and a singleton cluster, and, just to clean things up, delete the cluster itself.

Where have I lost myself?

diyoyo commented 4 months ago

They wouldn't be in their own cluster if they could match those other clusters to begin with.

Ok, I guess this is where my reasoning falls apart. I was assuming that the more diversity in the clusters that I've manually grown (from picking up singletons and merging them) would change the distance between the leftover-singletons and those manually-grown clusters.

But without knowing how the distance is calculated, and how the metric is derived from the components of the clusters, I guess I was just dreaming out loud :)

mertalev commented 4 months ago

The distance stays the same, so if they didn't match something then it wouldn't "normally" change anything to re-run it. But the indexing aspect does make it likely that at least some of those single face clusters would end up matching something after all.

To answer your question, the queries look fine to me. The UPDATE asset_faces is probably unnecessary since deleting the person will cascade that change to the asset_faces table as well. I'd also make a backup first to be safe.

diyoyo commented 4 months ago

The UPDATE asset_faces is probably unnecessary since deleting the person will cascade that change to the asset_faces table as well. I'd also make a backup first to be safe.

Yes, the UPDATE was there out of lazyness (simpler than double-checking all the cascading is properly implemented 🤣, sorry 🙈)

The distance stays the same

This is where you need to educate me (maybe you did already and I missed the point): how come? why adding elements to a cluster doesn't affect the distance between the cluster and the outside faces?

mertalev commented 4 months ago

The distance is always between individual faces. Clustering can affect which person a face gets assigned, not whether it will be assigned a person.

diyoyo commented 4 months ago

Ok, thanks for your patience. I understand now and I feel pretty stupid about it since I've done my own models in the past, just not about pics with a personal aspect. It's easier to abstract those concepts when you don't have the faces you recognize easily in front of you. I guess I proved "I'm not a bot".

I think it was mentioned on a different thread, that the dream would be to use the birth date and the pic timestamp to extrapolate the face (and generate extra embeddings?) at different ages. To add more complexity, one could guess the quality of the picture given the timestamp or some manual metadata (for scanned pics) and apply dynamic thresholds. Or a camera model classifier... Ok, too much already.

mertalev commented 4 months ago

I think we would need a more robust testing environment for facial recognition before making those kinds of inferences. Making facial recognition better for one library is one thing, but it can backfire for other libraries if you aren't careful.

A small change that was mentioned that I agree with is to order images by date in descending order when queueing facial recognition. The idea is to guide it through the transition from childhood to adulthood. By ordering it, it can gradually expand to include faces of a different age instead of failing to recognize a face and creating a new person. Queueing in descending order (newest first) should be best since adult faces are more distinguishable.

diyoyo commented 4 months ago

So, at the end of the day, I ran the following query. the "isArchived" may have been the reason why I had this difference of 30 people compared to the number at the top of the People's tab. (I still have an offset of 3 though)

DELETE FROM person
 WHERE "id" IN (
    SELECT p."id"
    FROM person p
        INNER JOIN asset_faces af
               ON af."personId"=p."id" 
        INNER JOIN assets ass
               ON ass."id"=af."assetId"
     WHERE p."isHidden"='f' 
              AND p."thumbnailPath"<>'' 
              AND ass."isArchived"='f'
              AND p."name"=''
    GROUP BY p."id"
    HAVING COUNT(af."id")=1
    );

Basically, in the previous query, I had included the hidden people too, which would have been stupid, since I manually hid those people. This query keeps all my work and resets the rest (I guess the Count=1 is not really necessary, but I wanted to have a real look at count>1).

Anyways, this made me go from 4887 persons to 1599, with approximately 1350 named people.

I ran all the jobs in "Missing" mode (and btw, there may be the same indexing problem for transcoding video as the one you fixed for faces, since the queue is always about the same size as the amount of videos)

And at the end of the jobs, I reached 4239 People. So this manoeuvre removed 600 faces from my workload, in theory, if the clustering was good. After all the discussion, I believe that the explanation is the one you provided already:

The only gain I suppose would be that the index should be cleaner now than before, so perhaps some of those faces could match existing people.

mertalev commented 4 months ago

Thanks for sharing your results! 600 of those people being recognized now is pretty interesting. I made a PR that fixes this indexing issue so it doesn't get any duplicate embeddings. I knew that had an effect on recall, but I never realized it was this dramatic.

diyoyo commented 4 months ago

Wait for it, I just realized that there was a hidden cluster that was getting all the "attention": it has about 500 faces and they clearly are from plenty of different persons. Since I excluded hidden clusters and clusters with more than 1 face, this was present already before the "little" experiment of tonight. I'm going to unhide the top 3 hidden clusters and remove "HAVING COUNT" from my query. Let's see...

diyoyo commented 4 months ago

So, when looking at the top hidden clusters: only one of them was really a mess, with plenty of different people. The other clusters were real clusters of people I just do not care about.

Before DELETING: people count = 4245 Result of the query: DELETE 2795 Post DELETING: people count = 1450

Post "Recognition : Missing": people count = 2278 Looking at the top unnamed clusters (hidden or not) : the previous cluster of 500 no longer exists, but a new one with 105 photos seems to be its successor, as it is made of plenty of different persons.

So this time, 2000 faces have been either placed in a cluster (or unassigned?). Looking at labelled cluster, there has been some changes, but not to that extent. I doubt that the explanation is as easy as : theses faces have been scattered in very small clusters (2-3 people)...

to be continued...

diyoyo commented 4 months ago

I think everytime I perform a query that writes to the DB, I should do a REINDEXING afterward. Because this thing with the drop from 4245 to 2278 was clearly an artifact. Many people had been Unassigned. After reindexing + recog:Missing, we're back to 4377.

mertalev commented 4 months ago

Interesting! TBH, dropping the index would arguably make more sense in your case, at least compared to constant reindexing.

mxm199 commented 4 months ago

Good day to everyone. I came across your curious topic, and an interesting question arose, is it possible to download a list of photos from the database that have an unassigned face?

mertalev commented 4 months ago

Yes, querying this will give you a list of asset ids and their paths (internal to the container): SELECT DISTINCT ON (a.id) a.id, a."originalPath" FROM assets a INNER JOIN asset_faces af ON a.id = af."assetId" WHERE af."personId" IS NULL;.

mxm199 commented 4 months ago

Thanks! I forgot that I need to log in to the database first) Maybe someone else will need it, so the list will be uploaded to a file, which can then be processed manually, unless of course the number of face is small

\COPY (SELECT DISTINCT ON (a.id) a.id, a."originalPath" FROM assets a INNER JOIN asset_faces af ON a.id = af."assetId" WHERE af."personId" IS NULL) TO '/tmp/list.csv' WITH CSV HEADER;