[Bug]: Run is crashing when specifying embeddings

Yann-CV commented 1 month ago

What happened?

Running Fastdup.run is crashing when providing already computed embeddings

What did you expect to see?

no failure

What version of fastdup were you runnning on?

1.124

What version of Python were you running on?

Python 3.10

Operating System

Ubuntu 20.04

Reproduction steps

import fastdup
import torch

fd = fastdup.create()
fd.run(embeddings=torch.randn((100, 384)).numpy(),)

Relevant log output

Traceback (most recent call last):
  File "/home/yann-cv/sensei/spark/.venv/lib/python3.10/site-packages/fastdup/sentry.py", line 135, in inner_function
    ret = func(*args, **kwargs)
  File "/home/yann-cv/sensei/spark/.venv/lib/python3.10/site-packages/fastdup/fastdup_controller.py", line 600, in run
    self._create_img_mapping()
  File "/home/yann-cv/sensei/spark/.venv/lib/python3.10/site-packages/fastdup/fastdup_controller.py", line 1055, in _create_img_mapping
    df_annot = pd.merge(self._df_annot, total, on=FD.ANNOT_FILENAME, how='left')
  File "/home/yann-cv/sensei/spark/.venv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 170, in merge
    op = _MergeOperation(
  File "/home/yann-cv/sensei/spark/.venv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 807, in __init__
    self._maybe_coerce_merge_keys()
  File "/home/yann-cv/sensei/spark/.venv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1508, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns for key 'filename'. If you wish to proceed you should use pd.concat

pandas==2.2.2



### Attach a screenshot [Optional]

_No response_

### Contact Details [Optional]

_No response_

dbickson commented 1 month ago

HI @Yann-CV apologies for the unclear error message. You did not specify input_dir, should be a list of image locations, matching in length to the numpy embeddings array. In addition work_dir should point to a temporary work place to store intermediate files. We will fix the error to be clearer.

This is the correct format as expressed in our tutorial: Run On Pre-computed Feature Vectors

If you have pre-computed feature vectors using fastdup or any other methods, you can input the features directly into fastdup to analyze for issues. Running fastdup on feature vectors instead of raw images decreases run time significantly.

The following code snippet shows how to run with your own feature stored in a numpy matrix, along with a list of the matching filenames.

import numpy as np
import fastdup

# Replace the below code with computation of your own features
matrix = np.random.rand(2, 576).astype('float32')
flist = ["/data/myimage1.jpg", "/data/myimage2.jpg"]

# Files should contain absolute path and not relative path
fd = fastdup.create(input_dir='/data/', work_dir='output')  
fd.run(annotations=flist, embeddings=matrix)

Yann-CV commented 1 month ago

@dbickson thanks for the feedback. Indeed it works with the list of filenames.

In my opinion, if this image list is required at some point within the run, it should not be annotated as Optional in the code.

dbickson commented 1 month ago

hi @Yann-CV version 1.125 defends better against embeddings without the file list. Let us know if you observer any other issue.

visual-layer / fastdup