visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Other
1.56k stars 76 forks source link

Embeddings param uses hard coded 576 embedding dim and not dim passed #147

Closed lanalex closed 1 year ago

lanalex commented 1 year ago

When passing embeddings=embeddings_matrix (shape = (N, embedding_dim) np array of type float 32 (also ta simple doc example for using this case would be much appreciated :) ). In fastdup.run, the code throws an error that the embedding dim is different form 576 . Because of this The expected behaviour that the embedding dim will be automatically deduced.


class FastdupController:
    @v1_sentry_handler
    def __init__(self, work_dir: Union[str, Path], input_dir: Union[str, Path] = None):
        """
        This class serves as a proxy for fastdup basic usage,
        the class wraps fastdup-run call provides quick access to
        fastdup files such as: similarity,  csv outlier csv, etc...

        Moreover, the class provides several extra features:
            - Ability to run connected component analysis on splits without calling fastdup run again
            - Ability to add annotation file and quickly merge it to any of fastdup inputs
        Currently the class support running fastdup on images and object
        :param work_dir: target output dir or existing output dir
        :param input_dir: (Optional) path to data dir
        """
        # check if fastdup was already applied
        self._fastdup_applied = is_fastdup_dir(work_dir)
        self._work_dir = Path(work_dir)
        self._input_dir = input_dir if input_dir is None else get_input_dir(input_dir)

        # set default arguments
        self._df_annot = None
        self._run_mode = FD.MODE_DEFAULT
        self._embeddings_dim = 576
        self._max_fd_id = 0
dbickson commented 1 year ago

Thanks @lanalex for pointing out this bug and unclear documentation. It will be fixed in 0.917 to be released later today.

lanalex commented 1 year ago

@dbickson Just to clarify the error is not only the dimension. It has some issues that the dataframe input structure. if you can also test it with passing a yolo compatible dataframe (a dataframe with img_filename, bbox_h, bbox_w etc). I had to add an index column (explictly) and modify the code a bit to get it to work. Otherwise it falls on an assert ``` assert 'index' in df_mapping.columns and 'filename' in df_mapping.columns