Searching for duplicates in large dataset

visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

Other

1.56k stars 76 forks source link

Searching for duplicates in large dataset #155

Closed gilamsalem closed 1 year ago

gilamsalem commented 1 year ago

Hi, Assuming I have a large dataset ~200m images. And assuming I would like to find duplicates every time a new image is arrived to my system. Is itnpossible tonuse fastdup for that? What will be the recommended setup for that? Thanks!

dbickson commented 1 year ago

Hi @gilamsalem look at the fastdup.init_search and fastdup.search() methods https://visual-layer.readme.io/docs/v02xx-api#search. The current free version is limited to 1M image models. We would love to connect and explore collaboration, we are looking for companies who like to work with us as design partners and we could enable more advanced functionality. Internally we are working on an enterprise version which supports up to a billion image search. We will be happy to connect for a demo to share capabilities and performance.

dbickson commented 1 year ago

Example code is here:

    Example:
        >>> import fastdup
        # point to the work_dir where fastdup was run
        >>> fastdup.init_search(10, 'search_out')
        >>> from PIL import Image
        >>> img = Image.open('test_1234.jpg')
        >>> img = img.resize((224, 224), Image.Resampling.NEAREST)
        >>> img_size = 224 * 224
        >>> ret = fastdup.search(img, img_size, verbose=1)