[Bug]: fastdup some clusters contain low similarity imags

visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

Other

1.51k stars 74 forks source link

[Bug]: fastdup some clusters contain low similarity imags #302

Closed wangxiong172086864 closed 6 months ago

wangxiong172086864 commented 7 months ago

What happened?

running fastdup with ccthreshold = 0.97, I found the component 29 contains lots of low similarity imags. So I calculate the cosine similarity of this component using pytorch, cosine similarity of some pair of image is lower than 0.97, like 0.90, 0.82 So how the ccthreshold work?

What did you expect to see?

No response

What version of fastdup were you runnning on?

fastdup 1.63

What version of Python were you running on?

Python 3.8

Operating System

Ubuntu 20.04 LTS

Reproduction steps

No response

Relevant log output

No response

Attach a screenshot [Optional]

No response

Contact Details [Optional]

No response

dbickson commented 7 months ago

Hi @wangxiong172086864 your images are special and I am not sure our default embedding is right. Can you try to run with model_path='dinov2s' and let us know if the similarity is better?

wangxiong172086864 commented 7 months ago

Hi @wangxiong172086864 your images are special and I am not sure our default embedding is right. Can you try to run with model_path='dinov2s' and let us know if the similarity is better?

Tried with dinov2s, and the result still not good. May be the reason is not about the embedding, I pulled the images from this component out and ran fastdup again just with them got much better result.

dbickson commented 7 months ago

HI @wangxiong172086864 fastdup runs an approximation, so if you want an exact computation run with nnf_mode='Flat' (the reason it is not the default it runs slower and requires more RAM). Assuming your dataset is bellow a million images it should run fine.

wangxiong172086864 commented 7 months ago

HI @wangxiong172086864 fastdup runs an approximation, so if you want an exact computation run with nnf_mode='Flat' (the reason it is not the default it runs slower and requires more RAM). Assuming your dataset is bellow a million images it should run fine.

Danny, thanks, using nnf_mode='Flat' helps, similarity of images is better. And also higher up the ccthreshold value can get better result.

dnth commented 7 months ago

@wangxiong172086864 if you have your own embeddings extracted using models trained on your dataset, it might work better than the generic models used in fastdup.

Here's an example of how to run fastdup on your own embeddings https://visual-layer.readme.io/docs/run-on-precomputed-feature-vectors

dbickson commented 7 months ago

Another option is to increase nearest_neighbor_k parameter for example to 50. The default is 3.