rhsimplex / image-match

🎇 Quickly search over billions of images
2.94k stars 405 forks source link

Documentation of way to get complete distance matrix #66

Closed advance512 closed 7 years ago

advance512 commented 7 years ago

Hi there,

I have a set of 4000 images which I want to create into a cluster. My images are a large set of images taken from various fixed cameras (might move a small, small bit due to wind), some at day some at night, and they might have people, dogs, cats, etc. I am trying to create clusters based on the camera (i.e. clusters of images all taken by the same camera).

I'm planning on using HDBSCAN for this: http://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html

I've got image-match running and have done the following modifications to the library to attempt and get a complete distance matrix:

I have tried settings distance_cutoff of SignatureDatabaseBase() to 1.0, and size of SignatureES() to 4000, but I seem to be getting a sparse 4000x4000 matrix.

Is there any easy way to get the full distance matrix?


Also, any hints on when increasing k, N and n_grid is correct for more precise results?

I also noticed some images contain specific textual labels embedded in the image in the same places (like date/time and camera name). Since these labels aren't big, I'm pretty sure they're mostly ignored here - am I right?

rhsimplex commented 7 years ago

For 4000 images, I would not use the database part of the package. Just use the generate_signature method from the ImageSignature class in image_match/goldberg.py on your images, and then use the normalized_distance over all pairs of signatures to generate your distance matrix.

Roughly speaking, decreasing k and increasing N should give you better results at the expense of lookup speed. Similarly, increasing n_grid should give you more discerning signatures (i.e. longer). I haven't tested anything but the defaults with any rigor though.

You are correct in that the labels shouldn't make much of a difference. If you have a couple examples of images you expect to cluster, could you post them here so I could advise further?

Closing the issue, feel free to reopen.