Empty similarity.csv and outliers.csv. in some cases

kaane8520 commented 1 year ago

I have two datasets with images: A and B. I would like to compare those two:

get the most similar images from set A to images in set B and
get the most similar images from set B to images in set A. When I call fastdup.run() method with run_mode argument set on 3, where:
input_dir is A dataset location and test_dir is B dataset location everything is OK, but when
input_dir is B dataset location and test_dir is A dataset i get empty similarity.csv and outliers.csv.

I have tried different values for nearest_neighbors_k and threshold but it didn't help.

Could you help me find the problem?

dbickson commented 1 year ago

Hi @kaane8520 thanks for reaching out ! Your ask is very clear, in our terminology you would like to compare train and test dirs s.t. that relations are only between train and test but not internal.

Once you run fastdup on the train dataset only it builds a nearest neighbor model and stores it to the work_dir. Next you run on the same input and work_dir with run_mode=3 and point the test_dir into a new set of images you will get compared images to the train. Detailed explanation is found here: https://github.com/visual-layer/fastdup/blob/main/RUN.md#resume

If you want to run the opposite, you need to run fastdup again using input_dir pointing to the test_dir, and clean work_dir using the default run_mode, and then run with run_mode=3 where the test_dir points to the training data.

However, since the metric is symmetric, the most similar images from set A to B are the same from B to A, so you can use the same output similarity.csv file you got.

A parameter to play with is threshold, if the two sets are very different, you may get no output you can try and run with threshold=0 not to remove any similarities from the output and then you will get also law similarities..

dbickson commented 1 year ago

BTW I forgot to write that it is possible to simplify this into a single run in case you have both the train and test set available, in this case just run with input_dir pointing to the train, test_dir pointing to the test, and the relationships are computed only between train and test.

The reason you may want to defer is that sometimes test data comes later and you want to use a precomputed trained model.

kaane8520 commented 1 year ago

Thank you for your help

visual-layer / fastdup

Empty similarity.csv and outliers.csv. in some cases #67