rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.69k stars 338 forks source link

Duplicate images in ms coco #244

Open tungdop2 opened 1 year ago

tungdop2 commented 1 year ago

In MSCOCO or Visual Gnome, an image has more than 1 caption, so img2dataset will download it 3 or 4 times. How to solve this problem?

img2dataset --url_list source/mscoco.parquet --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format files\
           --output_folder source/mscoco --processes_count 8 --thread_count 16 --image_size 224 \
             --enable_wandb False
rom1504 commented 1 year ago

Hi, you may decide to join the caption in a single column and then use save_additional_columns option to put them in the json file next to images

tungdop2 commented 1 year ago

@rom1504 thank for your reply. So download multiple times in MSCOCO is default setting?

rom1504 commented 1 year ago

@tungdop2 it seems this metadata parquet file has this issue do you want to fix it and upload a better version to huggingface?