visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Other
1.56k stars 76 forks source link

It is possible to look for nearest neighbors for a new image? #156

Closed woctezuma closed 1 year ago

woctezuma commented 1 year ago

Hello,

Thank you for the nice tool! I knew about idealo/imagededup for de-duplication. Your tool seems interesting with more features.

Is it possible to look for nearest neighbors for a new image once the embeddings have been computed for the database of images? I have not found the option in the docs of fastdup.

Here are a few tools which use embeddings (less interesting than DINOv2) for the purpose of image retrieval:

dbickson commented 1 year ago

Hi @woctezuma thanks for reaching out. Yes, you can find similar images using init_search() and search() as in the documentation here: https://visual-layer.readme.io/docs/v02xx-api#search Please try it out and let us know if this works for you.

dbickson commented 1 year ago

Closing issue, please reopen if this does not work.

woctezuma commented 1 year ago

Not sure how to make the call to search() work.

%pip install fastdup mediapy
%pip install --ignore-installed Pillow==9.0.0
%cd /content
!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz
!tar xf images.tar.gz
import fastdup

fd = fastdup.create(input_dir="images/", work_dir="fastdup_work_dir/")
fd.run() # 2 minutes
fastdup.init_search(k=5, work_dir="fastdup_work_dir/")
import mediapy as media

img = media.read_image('/content/images/Abyssinian_1.jpg')
print(img.shape)

length = min(img.shape[:2])
img = img[:length, :length, :]

fastdup.search(img, img.shape[0])

returns:

(400, 600, 3)
Failed to search for image
1
woctezuma commented 1 year ago

I see there is run_mode which may be useful for this task.

dbickson commented 1 year ago

Hi @woctezuma. Here is a code snippet for performing search:

import fastdup
fastdup.init_search(10, '/Users/dannybickson/visual_database/cxx/unittests/search_out')
import cv2
from PIL import Image

img = Image.open('/Users/dannybickson/visual_database/cxx/unittests/one_image/test_1234.jpg')
img = img.resize((224, 224), Image.Resampling.NEAREST)
img_size = 224 * 224
img = cv2.cvtColor(np.array(img), cv2.COLOR_BGR2RGB)
ret = fastdup.search(Image.fromarray(img), img_size, verbose=1)

Please let us know if it works for you!

woctezuma commented 1 year ago

Not sure what I am doing wrong. I have run fastdup once (~ 8 minutes, visualization is fine). Then:

import fastdup
import mediapy as media

from PIL import Image

img_dir = '/content/data'
work_dir = '/content/output'

fd = fastdup.create(work_dir, img_dir)
fd.run(nearest_neighbors_k=1, cc_threshold=0.96)

fastdup.init_search(k=1, work_dir=work_dir)

fname = f'{img_dir}/1056970.jpg'
img = media.read_image(fname)
print(img.shape)

fastdup.search(Image.fromarray(img), img.shape[0]*img.shape[1], verbose=1)
/usr/local/lib/python3.9/dist-packages/fastdup/fastdup_controller.py:335: UserWarning: Fastdup was already applied, use overwrite=True to re-run
  warnings.warn('Fastdup was already applied, use overwrite=True to re-run')

(224, 224, 3)

Failed to search for image
1
dbickson commented 1 year ago

Hi @woctezuma your API call look correct. If you like to connect for a short zoom screenshare session I will be happy to look and see why it is failing. Which operating system are you using? Please try version 0.922

woctezuma commented 1 year ago

Hello, I am running the code on Google Colab. I will try with a smaller dataset.

I will see if I can make a minimal example with the small dataset.

woctezuma commented 1 year ago

If you want to have a look, you can run the following code on Google Colaboratory:

%cd /content

!wget https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip
!unzip -q balloon_dataset.zip

%pip install -q fastdup
import cv2
import fastdup
import numpy as np

from PIL import Image

img_dir = '/content/balloon/train'
work_dir = '/content/output'

fd = fastdup.create(work_dir, img_dir)
fd.run(nearest_neighbors_k=1, cc_threshold=0.96)

fastdup.init_search(k=1, work_dir=work_dir)

fname = f'{img_dir}/10464445726_6f1e3bbe6a_k.jpg'

img = Image.open(fname)
img = img.resize((224, 224))
print(img.size)
img = cv2.cvtColor(np.array(img), cv2.COLOR_BGR2RGB)
print(img.shape)

fastdup.search(Image.fromarray(img), img.shape[0]*img.shape[1], verbose=1)

Output:

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
Warning: Missing file /content/output/component_info.csv

 ########################################################################################

Dataset Analysis Summary: 

    Dataset contains 61 images
    Valid images are 100.00% (61) of the data, invalid are 0.00% (0) of the data
    Similarity:  0.00% (0) belong to 0 similarity clusters (components).
    100.00% (61) images do not belong to any similarity cluster.
    Largest cluster has 0 (0.00%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.96).

    Outliers: 4.92% (3) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.

(224, 224)

(224, 224, 3)

Failed to search for image
1
dbickson commented 1 year ago

Hi @woctezuma thanks for your great feedback you are helping us to make fastdup better!! I was able to fix all issues as explained below. 1) Our search() capabilities requires a free sign up for our fastdup beta version. Please email us at info@visual-layer.com with your name, email and company/academic institution/ non profit and we will issue a free license key for you immediately. 2) From some strange reason, you had to enter k>=2 on init_search(), we will fix this requirement on next version 0.923, currently once you get the free license key please use k>=2. 3) Due to jupyter issues and the fact our engine works in c++ I had to "cheat" to get the output as follows:

!python -c "import fastdup; import cv2; import numpy as np; from PIL import Image; work_dir = '/content/output'; img_dir = '/content/balloon/train'; fastdup.init_search(k=2, work_dir=work_dir, license='<put your license key here>'); \
fname = f'{img_dir}/10464445726_6f1e3bbe6a_k.jpg'; \
img = Image.open(fname); \
img = img.resize((224, 224)); \
img_size = 224 * 224; \
img = cv2.cvtColor(np.array(img), cv2.COLOR_BGR2RGB); \
fastdup.search(Image.fromarray(img), img_size, verbose=1); \
"

We will try to fix the output collection on our side to prevent error messages and traces getting lost on jupyter (they are outputted to the terminal whcih opened jupyter but of course you can not see it). Namely instead running in a jupyter cell I was opening a sperate python process and running the python script inside it. This brings all output back to the cell.

Here is the full notebook for your reference: https://colab.research.google.com/drive/14o8bkWQeiH7dTLDYdc68r0V--bEtKkgE?usp=sharing

4) I found out we do not support changing the onnx model on the fly on search, we will fix in version 0.923. So you can use the default fastdup model.

Please feel free to share any additional feedback!

woctezuma commented 1 year ago

Thanks! I will try the !python -c "" trick.

Getting the output directly in the notebook would be nice, also during fd.run().

dbickson commented 1 year ago

HI @woctezuma based on your useful feedback we have significantly simplified the search in version 0.923 as follows:

  Example:
        >>> import fastdup
        >>> input_dir = "/my/input/dir"
        >>> work_dir = /my/work/dir"
        >>> fastdup.run(input_dir, work_dir)
        # point to the work_dir where fastdup was run
        >>> fastdup.init_search(10, work_dir, verbose=True, license=<my license>)
        # The below code can be executed multiple times, each time with a new searched image
        >>> df = fastdup.search("myimage.jpg", None, verbose=True)
        # optional: display search output
        >>> fastdup.create_duplicates_gallery(df, ".",input_dir=input_dir)

Please try it out and let us know if this is useful. We would love to connect for a short zoom call to learn about your use case and see if we can help further with anything.

woctezuma commented 1 year ago

Thank you. It is fine as it is:

Out of curiosity, I will try version 0.923 on Google Colab once the wheels are available for Linux. 👍

dbickson commented 1 year ago

Hi @woctezuma Now released for Ubutnu 20 and working on 18