visual-layer / fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
Other
1.61k stars 77 forks source link

[Bug]: "AssertionError: Failed to find input dir please check your input." for zipped image input #206

Closed zilunzhang closed 1 year ago

zilunzhang commented 1 year ago

What happened?

I tried to use local zipped images for data cleaning, and I encountered a problem...

Following this link and this link, I run the fastdup (v1.3) with code:

  1. work_dir = "/media/zilun/wd-16/RS5M_T_dataset/tmp-clean-outlier-zip-official/"
  2. images_dir = "/media/zilun/wd-16/RS5M_T_dataset/zip/tmp/"
  3. fastdup.run(input_dir=images_dir, run_mode=1, work_dir=work_dir, nearest_neighbors_k=5, threshold=0.9, high_accuracy=True)
  4. fastdup.run(input_dir='', run_mode=2, work_dir=work_dir, nearest_neighbors_k=5, threshold=0.9, high_accuracy=True)

An error pops up when line 4 was running:

The structure of work directory is:

Then I tried to deal with zipped images in this way:

  1. work_dir = "/media/zilun/wd-16/RS5M_T_dataset/tmp-clean-outlier-zip-create/"
  2. images_dir = "/media/zilun/wd-16/RS5M_T_dataset/zip/tmp/"
  3. fd = fastdup.create(work_dir, images_dir)
  4. fd.run(nearest_neighbors_k=5, threshold=0.9, cc_threshold=0.9, high_accuracy=True, outlier_percentile=0.01, run_mode=1)

An error pops up again in line 3:

The structure of work directory is:

There isn't any flie named "atrain_features.dat.csv", but "atrain_mediazilunwd-16RS5M_T_datasetziptmpvlmf-0.95_cf-0.95.zipfeatures.dat.csv" exists. Maybe the filename is incorrect?

What did you expect to see?

Successfully execute line 4 without any error. All outliers can be listed.

What version of fastdup were you runnning on?

1.3

What version of Python were you running on?

Python 3.8

Operating System

Ubuntu 22.04.1 LTS

Reproduction steps

No response

Relevant log output

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 fastdup.run(input_dir='', run_mode=2, work_dir=work_dir, nearest_neighbors_k=5, threshold=0.9, high_accuracy=True)

File ~/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/__init__.py:698, in run(input_dir, work_dir, test_dir, compute, verbose, num_threads, num_images, turi_param, distance, threshold, lower_threshold, model_path, license, version, nearest_neighbors_k, d, run_mode, nn_provider, min_offset, max_offset, nnf_mode, nnf_param, bounding_box, batch_size, resume, high_accuracy)
    694         out_df.to_csv(input_dir, index=False)
    696     turi_param = turi_param.replace(',save_crops=1', '')
--> 698 ret = do_run(input_dir=input_dir,
    699          work_dir=work_dir,
    700          test_dir=test_dir,
    701          compute=compute,
    702          verbose=verbose,
    703          num_threads=num_threads,
    704          num_images=num_images,
    705          turi_param=turi_param if not fd_model else turi_param.replace(',save_crops=1','').replace('save_crops=1',''),
    706          distance=distance,
    707          threshold=threshold,
    708          lower_threshold=lower_threshold,
    709          model_path=model_path,
    710          license=license,
    711          version=version,
    712          nearest_neighbors_k=nearest_neighbors_k,
    713          d=d,
    714          run_mode=run_mode,
    715          nn_provider=nn_provider,
    716          min_offset=min_offset,
    717          max_offset=max_offset,
    718          nnf_mode=nnf_mode,
    719          nnf_param=nnf_param,
    720          bounding_box='',
    721          batch_size = batch_size,
    722          resume = resume,
    723          high_accuracy=high_accuracy)
    724 return ret

File ~/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/__init__.py:219, in do_run(input_dir, work_dir, test_dir, compute, verbose, num_threads, num_images, turi_param, distance, threshold, lower_threshold, model_path, license, version, nearest_neighbors_k, d, run_mode, nn_provider, min_offset, max_offset, nnf_mode, nnf_param, bounding_box, batch_size, resume, high_accuracy)
    217             print("Warning: Reading images directly with s3 may result in slow execution. If you have enough disk space it is recommened to run with sync_s3_to_local=True. This will download the s3 content first to the local drive and then run fastdup.")
    218     else:
--> 219         assert False, f"Failed to find input dir {input_dir} please check your input."
    220 else:
    221     if os.path.isfile(input_dir):

AssertionError: Failed to find input dir  please check your input.

Traceback (most recent call last):
  File "/home/zilun/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/sentry.py", line 130, in inner_function
    ret = func(*args, **kwargs)
  File "/home/zilun/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/fastdup_controller.py", line 373, in run
    self._create_img_mapping()
  File "/home/zilun/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/fastdup_controller.py", line 758, in _create_img_mapping
    assert df_mapping is not None and not df_mapping.empty, f"Failed to find {FD.MAPPING_CSV} in work_dir"
AssertionError: Failed to find atrain_features.dat.csv in work_dir
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[8], line 1
----> 1 fd.run(nearest_neighbors_k=5, threshold=0.9, cc_threshold=0.9, high_accuracy=True, outlier_percentile=0.01, run_mode=1)

File ~/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/engine.py:157, in Fastdup.run(self, input_dir, annotations, embeddings, subset, data_type, overwrite, model_path, distance, nearest_neighbors_k, threshold, outlier_percentile, num_threads, num_images, verbose, license, high_accuracy, cc_threshold, **kwargs)
    154     fastdup_func_params['model_path'] = model_path
    155 fastdup_func_params.update(kwargs)
--> 157 super().run(annotations=annotations, input_dir=input_dir, subset=subset, data_type=data_type,
    158             overwrite=overwrite, embeddings=embeddings, **fastdup_func_params)

File ~/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/sentry.py:136, in v1_sentry_handler.<locals>.inner_function(*args, **kwargs)
    134 except Exception as ex:
    135     fastdup_capture_exception(f"V1:{func.__name__}", ex)
--> 136     raise ex

File ~/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/sentry.py:130, in v1_sentry_handler.<locals>.inner_function(*args, **kwargs)
    128 try:
    129     start_time = time.time()
--> 130     ret = func(*args, **kwargs)
    131     fastdup_performance_capture(f"V1:{func.__name__}", start_time)
    132     return ret

File ~/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/fastdup_controller.py:373, in FastdupController.run(self, input_dir, annotations, subset, embeddings, data_type, overwrite, print_summary, **fastdup_kwargs)
    369 #fastdup_convert_to_relpath(self._work_dir, self._filename_prefix)
    370 
    371 # post process - map fastdup-id to image (for bbox this is done in self._set_fastdup_input)
    372 if self._dtype == FD.IMG or self._run_mode == FD.MODE_CROP:
--> 373     self._create_img_mapping()
    375 # expand annotation csv to include files that are not in annotation but is in subset
    376 self._expand_annot_df()

File ~/anaconda3/envs/cuda11/lib/python3.8/site-packages/fastdup/fastdup_controller.py:758, in FastdupController._create_img_mapping(self)
    756 # get mapping df from fastdup
    757 df_mapping = self._fetch_df(FD.MAPPING_CSV)
--> 758 assert df_mapping is not None and not df_mapping.empty, f"Failed to find {FD.MAPPING_CSV} in work_dir"
    759 df_mapping = df_mapping.reset_index()
    760 if FD.MAP_INST_ID not in df_mapping.columns:

AssertionError: Failed to find atrain_features.dat.csv in work_dir

Attach a screenshot [Optional]

Screenshot from 2023-05-20 18-09-50 Screenshot from 2023-05-20 18-07-16

Contact Details [Optional]

zilun@cs.toronto.edu

dbickson commented 1 year ago

Hi @zilunzhang for step 4, can you please try to run with input_dir=images_dir. Tar/zip file are only supported with v0.2 API and not with v1 api so running fd = fastdup.create() does not work yet on compressed files.

zilunzhang commented 1 year ago

Hi @zilunzhang for step 4, can you please try to run with input_dir=images_dir. Tar/zip file are only supported with v0.2 API and not with v1 api so running fd = fastdup.create() does not work yet on compressed files.

Thanks Danny, that works.

Maybe you would like to revise the documentation here sometimes...

One more thing, I found that the tmp folder of storing unzipped images (/media/zilun/wd-16/RS5M_T_dataset/tmp-clean-outlier-zip-official/tmp/mediazilunwd-16RS5M_T_datasetziptmpvlmf-0.95_cf-0.95.zip) became empty after running step 4, which causes the outlier report cannot find image to show through the path. For example, failed to read image from img_path /media/zilun/wd-16/RS5M_T_dataset/tmp-clean-outlier-zip-official/tmp/mediazilunwd-16RS5M_T_datasetziptmpvlmf-0.95_cf-0.95.zip/laion2b_128_107424.jpg. Any suggestion on that...?

dbickson commented 1 year ago

HI @zilunzhang we already fixed the documentation! We recommend running with turi_param='delete_tar=0,delete_img=0' in case you don't want to delete tar downloaded from s3 and do not delete images. We will fix the doc as well.

zilunzhang commented 1 year ago

HI @zilunzhang we already fixed the documentation! We recommend running with turi_param='delete_tar=0,delete_img=0' in case you don't want to delete tar downloaded from s3 and do not delete images. We will fix the doc as well.

Thank you, I will try it now!