rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.71k stars 338 forks source link

Failed to download all of CC12M #242

Closed AlaaKhaddaj closed 1 year ago

AlaaKhaddaj commented 1 year ago

I have been trying to download CC12M, using the same instructions you provided, however, the download is not complete.

I am getting the following error for a significant number of iterations:

total - success: 0.792 - failed to download: 0.195 - failed to resize: 0.013 - images per sec: 485 - count: 12423374

After the code is done, I end up with 1243 tar files. How can I solve this to get the full CC12M dataset?

rom1504 commented 1 year ago

Hey, did you set up knot resolver for DNS resolving? This is really important to avoid overloading your DNS and hence having a low success rate

On Wed, Dec 14, 2022, 15:34 AlaaKhaddaj @.***> wrote:

I have been trying to download CC12M, using the same instructions you provided, however, the download is not complete.

I am getting the following error for a significant number of iterations:

total - success: 0.792 - failed to download: 0.195 - failed to resize: 0.013 - images per sec: 485 - count: 12423374

After the code is done, I end up with 1243 tar files. How can I solve this to get the full CC12M dataset?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437WB34WFXPJEZ2J74PDWNHLIFANCNFSM6AAAAAAS6R3NSI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Reveyer commented 1 year ago

@rom1504 Hi, is there any way to re-download only the failed data?

learnerynwei commented 1 year ago

+1 is there method?

rom1504 commented 1 year ago

You can read the output parquet files and select only the samples that are failed status, write that as parquet (you can do that with pandas or spark) Then rerun img2dataset on it

On Fri, May 26, 2023, 03:42 Oliver Wei @.***> wrote:

+1 is there method?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/242#issuecomment-1563703809, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TO7VERUZX34437QLTXIADAJANCNFSM6AAAAAAS6R3NSI . You are receiving this because you were mentioned.Message ID: @.***>