How clustering is done - Githubissues

SirVizzy commented 1 month ago

I'm wondering how the clustering algorithm works in the provided demo. I cannot see it as the code has been obfuscated. I'd like to create the clusters in the exact same way.

vitali-fedulov commented 1 month ago

When image A is similar to image B I check if there is a cluster X where either A or B was previously added and add either A or B to this cluster, depending which one was not already in the cluster.

If there is no cluster where either A or B is present I create a new cluster Y and add both A and B into the cluster.

Important: I add A or B to only one cluster. But hypothetically there might be situations when an image is similar to several clusters. All these additional clusters are left unchecked. So the method assumes there are no "bridges" between clusters in N-dimensional space, meaning clusters are well separated between each other in the space. With this assumption an image can belong to only one cluster. This introduces a kind of competition between clusters, so the first one iteratively checked "swallows" the image, while potential others "stay hungry". I observed this "unfulfilled hunger" sometimes. But allowing an image into multiple clusters makes later visual analysis impractical. And if there are many similar images, this can overload computer memory.

SirVizzy commented 1 month ago

Thank you, I was kind of able to figure out how it works de-obfuscating the provided JavaScript. Thanks for the clarification. I'm sorting the images by the creation date using the EXIF metadata, leaving images taken closer together to be more likely to be matched as similar (which should work well). Will close this now.

vitali-fedulov / similar.pictures

How clustering is done #4