rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.57k stars 332 forks source link

Very low speed on subcaptions and mscoco dataset #252

Open KohakuBlueleaf opened 1 year ago

KohakuBlueleaf commented 1 year ago

I want to download mscoco dataset and subcaptions dataset. I just following the instructions and use the metadatas mentioned in examples/ but get very poor performance: Network bandwidth: 1Gbps, comsumed: 20\~50Mbps CPU utilization: below 30% SSD utilization: below 10% Fail rate: over 0.5 (this fail rate doesn't make sense even though SBUcaptions has 10\~20% files being 404/410)

I use this command to download sbucaptions:

img2dataset --url_list ./sbu-captions-all.json --input_format "json" --url_col "image_urls"\
 --caption_col "captions" --output_format webdataset --encode_format webp\
 --output_folder sbucaptions --processes_count 16 --thread_count 512 --image_size 256 --enable_wandb True

I have tried on windows/wsl2(with knot-resolver)/ubuntu(with knot-resolver), all these 3 get same result(ubuntu one has lower fail rate but also lower speed)

I also try to write a download script for sbucaptions and it works good, which can download 860k files and do resize(border mode) within 1hr.(Under windows) So the hardware or environment should be fine. (or should be workable)

rom1504 commented 1 year ago

where is ./sbu-captions-all.json downloaded from ?

thread_count 512 is way too high, try to decrease to like 64

rom1504 commented 1 year ago

https://www.cs.rice.edu/~vo9/sbucaptions/SBUCaptionedPhotoDataset.tar.gz ?

rom1504 commented 1 year ago

or https://www.cs.rice.edu/~vo9/sbucaptions/sbu-captions-all.tar.gz maybe ?

rom1504 commented 1 year ago

ok so one clear problem with sbu caption is that all the images are from flickr. That's an issue as downloading from a single domain will get you rate limited

rom1504 commented 1 year ago

nevertheless trying to run this:

img2dataset --url_list /media/hd2/subcaptions/sbu-captions-all.json --input_format "json" --url_col "image_urls"\
 --caption_col "captions" --output_format webdataset --encode_format webp\
 --output_folder /media/hd/testing/tmp_test --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb True
rom1504 commented 1 year ago

here is my wandb run for it https://wandb.ai/rom1504/img2dataset/runs/2nhepsmf Doing 1000 image/s with 16 cores

success rate 84%

rom1504 commented 1 year ago

added my benchmark at https://github.com/rom1504/img2dataset/commit/f14a04059f112e66f28b32fb856f3b8b6e2838d7

rom1504 commented 1 year ago

@KohakuBlueleaf Could you please check knot resolver is truly installed and active (make sure that system-resolved is not running and cpu usage of knot is > 0)

KohakuBlueleaf commented 1 year ago

@rom1504 I try your command and check htop. There are no systemd-resolver and knot resolver's cpu usage is 0.6\~1.2%. But the bandwidth is only 40\~50Mbps.(for acurate, I check the bandwidth comsume from router) total cpu usage is lower than 30% too. downloader/resizer's process has the highest cpu usage.

no idea why QQ

KohakuBlueleaf commented 1 year ago

And with img2dataset running, other device in same network cannot use the network. But if I use my own script, all works normal.(At least network will not be blocked)

rom1504 commented 1 year ago

that's surprising, I'm not sure what could be the difference between my environment and yours?

can you share your script ?

KohakuBlueleaf commented 1 year ago

@rom1504 here it is: https://gist.github.com/KohakuBlueleaf/420bb7febecd955aee07380024eef4c0

It's a very simple script

KohakuBlueleaf commented 1 year ago

It use aiohttp to download all images(so actually, in 1 proc and 1 thread) And when I has over 512 images, I send them to ProcessPoolExecuter to resize after download 10000url, it save all images to a .tar(webdataset)