visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Other
1.56k stars 76 forks source link

[Bug]: AssertionError when using a list of local files as input_dir #187

Closed markus-stoll closed 1 year ago

markus-stoll commented 1 year ago

What happened?

Using a list of local files in fastdup.create( input_dir=input_files, ... ) causes an Assertion Error: AssertionError: Failed to extract any files from list

I will provide a fix in a Fork and a Pull Request.

What did you expect to see?

######################################################################################## Dataset Analysis Summary:

Dataset contains 100 images

.....

What version of fastdup were you runnning on?

0.927

What version of Python were you running on?

Python 3.9

Operating System

Ubuntu

Reproduction steps

Install HuggingFace library of datasets pip install datasets

import datasets
import fastdup
from pathlib import Path

# load food101 dataset with images as local files paths
dataset = datasets.load_dataset("renumics/food101-enriched", split="all")
df = dataset.to_pandas()

# get the first 100 image local paths
input_files = [str(x) for x in df['image'].iloc[:100]]
print(input_files[:5])
assert all([Path(x).exists() for x in input_files])

# create a fastdup with the input files and run it
fd = fastdup.create(work_dir="/tmp/fastdub_workdir", input_dir=input_files)
fd.run(ccthreshold=0.9) 

Relevant log output

AssertionError                            Traceback (most recent call last)
Cell In[8], line 16
     14 # create a fastdup with the input files and run it
     15 fd = fastdup.create(work_dir="/tmp/fastdub_workdir", input_dir=input_files)
---> 16 fd.run(ccthreshold=0.9) 

File ~/playbook/.venv/lib/python3.9/site-packages/fastdup/engine.py:157, in Fastdup.run(self, input_dir, annotations, embeddings, subset, data_type, overwrite, model_path, distance, nearest_neighbors_k, threshold, outlier_percentile, num_threads, num_images, verbose, license, high_accuracy, cc_threshold, **kwargs)
    154     fastdup_func_params['model_path'] = model_path
    155 fastdup_func_params.update(kwargs)
--> 157 super().run(annotations=annotations, input_dir=input_dir, subset=subset, data_type=data_type,
    158             overwrite=overwrite, embeddings=embeddings, **fastdup_func_params)

File ~/playbook/.venv/lib/python3.9/site-packages/fastdup/sentry.py:134, in v1_sentry_handler..inner_function(*args, **kwargs)
    132 except Exception as ex:
    133     fastdup_capture_exception(f"V1:{func.__name__}", ex)
--> 134     raise ex

File ~/playbook/.venv/lib/python3.9/site-packages/fastdup/sentry.py:128, in v1_sentry_handler..inner_function(*args, **kwargs)
    126 try:
    127     start_time = time.time()
--> 128     ret = func(*args, **kwargs)
    129     fastdup_performance_capture(f"V1:{func.__name__}", start_time)
    130     return ret
...
    211             assert False, f"Unknown file type encountered in list: {f}"
--> 213 assert len(files), "Failed to extract any files from list"
    214 return files

AssertionError: Failed to extract any files from list

Attach a screenshot [Optional]

No response

Contact Details [Optional]

dbickson commented 1 year ago

Hi @markus-stoll can you try to follow those instructions : https://visual-layer.readme.io/docs/v1-api#fastdup.engine.Fastdup.run namely run with input_dir pointing to a folder and subset argument is a list of the subset of files you like to work on from this folder. Let us know if this works.

If that does not work, you can try v0 API (https://visual-layer.readme.io/docs/v02xx-api) namely run with

import fastdup
fastdup.run(input_dir=[put your list here], work_dir='/path/to/work/dir')

Let us know if this works. Once we get your response we will try to clarify documentation to make it simpler to use when using huggingface datasets.

dbickson commented 1 year ago

Hi @markus-stoll we found the source of the issue and we will release version 0.928 fix by tomorrow.

markus-stoll commented 1 year ago

Sounds good. You could also checkput my pull request #188. It fixed the problem forme. Maybe it helps.

dbickson commented 1 year ago

@markus-stoll you are a mind reader! I did almost 100% the same fix:

Screen Shot 2023-05-10 at 20 48 16