visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Other
1.54k stars 74 forks source link

[Bug]: analyzing-object-detection-dataset.ipynb error #204

Closed Lifeguard-alex closed 1 year ago

Lifeguard-alex commented 1 year ago

What happened?

analyzing-object-detection-dataset.ipynb

fastdup.create error :

AssertionError: Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename img_filename bbox_x bbox_y bbox_w bbox_h label ext split

version last 1.2

What did you expect to see?

No response

What version of fastdup were you runnning on?

2.1

What version of Python were you running on?

Other

Operating System

colab

Reproduction steps

1

Relevant log output

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/fastdup/sentry.py", line 130, in inner_function
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fastdup/fastdup_controller.py", line 347, in run
    assert isinstance(annotations, pd.DataFrame) and not annotations.empty and "filename" in annotations.columns, f"Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename {annotations}"
AssertionError: Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename             img_filename  bbox_x  bbox_y  bbox_w  bbox_h         label  ext  split
0       000000131075.jpg   20.23   55.98  313.49  326.50            tv    0  train
1       000000131075.jpg  176.90  381.12  286.20  136.63        laptop    0  train
2       000000131075.jpg  369.96  361.35   72.76   73.91        laptop    0  train
3       000000131075.jpg  411.68  417.87   66.32  129.44         chair    0  train
4       000000131075.jpg  367.31  363.25   72.27   67.01            tv    0  train
...                  ...     ...     ...     ...     ...           ...  ...    ...
183541  000000262103.jpg    2.45    0.91   94.03  181.51           car    0  train
183542  000000393195.jpg    6.10  214.53  331.31  262.83          boat    0  train
183543  000000393195.jpg   46.37    3.34  593.63  478.66        person    0  train
183544  000000393195.jpg  419.40    0.88  217.84  309.23        person    0  train
183545  000000131067.jpg    4.21    1.17  628.93  421.75  fire hydrant    0  train

[183544 rows x 8 columns]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-7-63943d33e800> in <cell line: 6>()
      4 
      5 fd = fastdup.create(work_dir=work_dir, input_dir=image_dir)
----> 6 fd.run(annotations=coco_annotations)

3 frames
/usr/local/lib/python3.10/dist-packages/fastdup/fastdup_controller.py in run(self, input_dir, annotations, subset, embeddings, data_type, overwrite, print_summary, **fastdup_kwargs)
    345                 annotations = pd.DataFrame({'filename':annotations})
    346 
--> 347             assert isinstance(annotations, pd.DataFrame) and not annotations.empty and "filename" in annotations.columns, f"Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename {annotations}"
    348             first_filename = annotations['filename'].values[0]
    349             if (str(input_dir)) != ".":

AssertionError: Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename             img_filename  bbox_x  bbox_y  bbox_w  bbox_h         label  ext  split
0       000000131075.jpg   20.23   55.98  313.49  326.50            tv    0  train
1       000000131075.jpg  176.90  381.12  286.20  136.63        laptop    0  train
2       000000131075.jpg  369.96  361.35   72.76   73.91        laptop    0  train
3       000000131075.jpg  411.68  417.87   66.32  129.44         chair    0  train
4       000000131075.jpg  367.31  363.25   72.27   67.01            tv    0  train
...                  ...     ...     ...     ...     ...           ...  ...    ...
183541  000000262103.jpg    2.45    0.91   94.03  181.51           car    0  train
183542  000000393195.jpg    6.10  214.53  331.31  262.83          boat    0  train
183543  000000393195.jpg   46.37    3.34  593.63  478.66        person    0  train

Attach a screenshot [Optional]

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/fastdup/sentry.py", line 130, in inner_function ret = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/fastdup/fastdup_controller.py", line 347, in run assert isinstance(annotations, pd.DataFrame) and not annotations.empty and "filename" in annotations.columns, f"Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename {annotations}" AssertionError: Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename img_filename bbox_x bbox_y bbox_w bbox_h label ext split 0 000000131075.jpg 20.23 55.98 313.49 326.50 tv 0 train 1 000000131075.jpg 176.90 381.12 286.20 136.63 laptop 0 train 2 000000131075.jpg 369.96 361.35 72.76 73.91 laptop 0 train 3 000000131075.jpg 411.68 417.87 66.32 129.44 chair 0 train 4 000000131075.jpg 367.31 363.25 72.27 67.01 tv 0 train ... ... ... ... ... ... ... ... ... 183541 000000262103.jpg 2.45 0.91 94.03 181.51 car 0 train 183542 000000393195.jpg 6.10 214.53 331.31 262.83 boat 0 train 183543 000000393195.jpg 46.37 3.34 593.63 478.66 person 0 train 183544 000000393195.jpg 419.40 0.88 217.84 309.23 person 0 train 183545 000000131067.jpg 4.21 1.17 628.93 421.75 fire hydrant 0 train

[183544 rows x 8 columns]

AssertionError Traceback (most recent call last) in <cell line: 6>() 4 5 fd = fastdup.create(work_dir=work_dir, input_dir=image_dir) ----> 6 fd.run(annotations=coco_annotations)

3 frames /usr/local/lib/python3.10/dist-packages/fastdup/fastdup_controller.py in run(self, input_dir, annotations, subset, embeddings, data_type, overwrite, print_summary, **fastdup_kwargs) 345 annotations = pd.DataFrame({'filename':annotations}) 346 --> 347 assert isinstance(annotations, pd.DataFrame) and not annotations.empty and "filename" in annotations.columns, f"Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename {annotations}" 348 first_filename = annotations['filename'].values[0] 349 if (str(input_dir)) != ".":

AssertionError: Got wrong annotation parameter, should be pd.DataFrame with the mandatory columns: filename img_filename bbox_x bbox_y bbox_w bbox_h label ext split 0 000000131075.jpg 20.23 55.98 313.49 326.50 tv 0 train 1 000000131075.jpg 176.90 381.12 286.20 136.63 laptop 0 train 2 000000131075.jpg 369.96 361.35 72.76 73.91 laptop 0 train 3 000000131075.jpg 411.68 417.87 66.32 129.44 chair 0 train 4 000000131075.jpg 367.31 363.25 72.27 67.01 tv 0 train ... ... ... ... ... ... ... ... ... 183541 000000262103.jpg 2.45 0.91 94.03 181.51 car 0 train 183542 000000393195.jpg 6.10 214.53 331.31 262.83 boat 0 train 183543 000000393195.jpg 46.37 3.34 593.63 478.66 person 0 train

Contact Details [Optional]

alex@lifeguard-ai.com

dbickson commented 1 year ago

Hi @Lifeguard-alex apologies for the error, it is an documentation mistake, please change the column name img_filename to filename in your annotation dataframe and let us if this works.

Lifeguard-alex commented 1 year ago

after fixing the img_filename to file name , i have new error:AssertionError: annotation dataframe should contain full path filenames, starting with coco_minitrain_25k/images/train2017

remember i runiing on your colab project

dbickson commented 1 year ago

Hi @Lifeguard-alex yeap you need the use full file names when creating the annotations dataframe, namely please add also the folder coco_minitrain_25k/images/train2017 and not just the image name. For example coco_minitrain_25k/images/train2017/image1.jpg

Lifeguard-alex commented 1 year ago

sorry i dont get it , this is from your https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb example

this example have bugs and its not working , how can i fix the path , as you are using coco dataset from the net and downloading it online ? this is not my code. ?

dbickson commented 1 year ago

Hi @Lifeguard-alex sorry about that let me fix the example and get back you shortly.

Lifeguard-alex commented 1 year ago

i fix the filename error and i fix the full path for image in csv , and now i have new error AssertionError: df_annot must contain unique filenames, found repeating filenames

guys , i going to give up :) the exmaples just not work and its full of bugs, in the ipynb and in the code exmaple

please make your code work with json coco , with multi filename and multi dir for images.

do you have any working exmaple for coco json i can test ?

dbickson commented 1 year ago

hi @Lifeguard-alex i have shared a fixed example. The main fix was

coco_csv = 'coco_minitrain_25k/annotations/coco_minitrain2017.csv'
coco_annotations = pd.read_csv(coco_csv, header=None, names=['filename', 'col_x', 'row_y',
                                                             'width', 'height', 'label', 'ext'])

coco_annotations['split'] = 'train'  # Only train files were loaded
coco_annotations['filename'] = coco_annotations['filename'].apply(lambda x: 'coco_minitrain_25k/images/train2017/'+x)
coco_annotations = coco_annotations.drop_duplicates()

What you see if the result of api change, we will fix the notebook end to end by tomorrow and share.

dbickson commented 1 year ago

hi @Lifeguard-alex a new fix has been released in version 1.3 the fixed notebook is here: https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb please try it out and let us know if you have any issue.

Ramayancv commented 6 months ago

@dbickson What is the proposed solution for AssertionError: df_annot must contain unique filenames, found repeating filenames? I see that even in the example of the COCO dataset provided by you in the notebook that two rows have same filename .

0 images/train2017/000000131075.jpg 20.23 55.98 313.49 326.50 tv 0 train
1 images/train2017/000000131075.jpg 176.90 381.12 286.20 136.63 laptop 0 train
dnth commented 6 months ago

@Ramayancv i ran the analyzing-object-detection-dataset.ipynb notebook (on Colab) and could not reproduce the error you get. Which version of fastdup are you running on? And which Python and OS?

Ramayancv commented 5 months ago

@dnth I am using fastdup version 1.65 ,Python 3.7.6 and Linux x86_64 OS. I noticed that the notebook runs smoothly on that dataset but run it on any other dataset, you will see the error.

dnth commented 5 months ago

@Ramayancv can you point to me a dataset so I can reproduce the error?

Ramayancv commented 5 months ago

@dnth Can you use these three images and this annotation file?

Annotations annots.csv

Images cp.zip

dnth commented 5 months ago

@Ramayancv the issue is in the column name of your annots.csv

image

If you rename the column to row_y it should work. image

But I find something else that is amiss, the width and height of the bounding box looks suspiciously small. Are those values correct? Or are the values normalized?

Ramayancv commented 5 months ago

@dnth Thank you very much . It worked.

Another question , i in your notebook for identifying Possible mismatch, 1) Where to find the dataframe that is shown at the end of the Notebook, one that shows mismatched values using similarity gallery ?

fd.vis.similarity_gallery(slice='diff')

2) In the following text mentioned in your notebook _The fastdup similarity search and similarity gallery are strong tools for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' (a strong sign of mislabels).

Running the similarity gallery shows if an image has high similarity with two of its closest neighbors, yet has different labels. This helps surface potential mislabeling in the dataset._

Does it creates the image embeddings of all the areas inside the bounding boxes and then compare those embeddings?

dnth commented 4 months ago

The DataFrame is returned together with the gallery for now. So you can do

df = fd.vis.similarity_gallery(slice='diff')

to get the DataFrame.