rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.71k stars 338 forks source link

many write error while using oss (s3-like) remote bucket storage #327

Open ldfandian opened 1 year ago

ldfandian commented 1 year ago

what could be the cause here, and how to deal with it?

(BTW, it's downloading 15m+ images)

img2dataset --url_list=oss://aigc-models-training/data/raw/crawler/ --output_folder=simplecache::oss://aigc-models-training/data/raw/img2dataset/ --processes_count=4 --thread_count=4 --resize_mode=no --output_format=webdataset --input_format=csv--url_col=url --caption_col=title --save_additional_columns="[detail]" --retries=5 —enable_wandb=True  --disallowed_header_directives="[]"

...

0it [00:00, ?it/s]

error code 
retrying to write to file due to error: [Errno 5] {'status': -2, 'x-oss-request-id': '', 'details': "RequestError: ('Connection aborted.', timeout('timed out'))"}
error code 
retrying to write to file due to error: [Errno 5] {'status': -2, 'x-oss-request-id': '', 'details': "RequestError: ('Connection aborted.', timeout('timed out'))"}
error code 
retrying to write to file due to error: [Errno 5] {'status': -2, 'x-oss-request-id': '', 'details': "RequestError: ('Connection aborted.', timeout('timed out'))"}
ldfandian commented 1 year ago

Also, the perf looks extrememly slow (considerably slower than local disk)... Is this expected?

rom1504 commented 1 year ago

This is not expected. What are the characteristics of your storage in term of latency and bandwidth? Can you test fsspec works well with it ?

On Sun, Jul 2, 2023, 08:18 Dian FAN @.***> wrote:

Also, the perf looks extrememly slow (considerably slower than local disk)... Is this expected?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/327#issuecomment-1616396672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W67Z5KM67IEJORX5TXOEHBTANCNFSM6AAAAAAZ3HETQ4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ldfandian commented 1 year ago

This is not expected. What are the characteristics of your storage in term of latency and bandwidth? Can you test fsspec works well with it ? On Sun, Jul 2, 2023, 08:18 Dian FAN @.> wrote: Also, the perf looks extrememly slow (considerably slower than local disk)... Is this expected? — Reply to this email directly, view it on GitHub <#327 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W67Z5KM67IEJORX5TXOEHBTANCNFSM6AAAAAAZ3HETQ4 . You are receiving this because you are subscribed to this thread.Message ID: @.>

Thanks for the quick response. I guess it is my bad... I used a "4 vcpu + 8G mem" small ec2 box, and I guess the network bandwidth or CPU was exhausted for the requests.

Neverthemind, what's the recommend machine configuration for downloading large dataset like laion2b-int or wukong dataset ? what's the best practice of CPU/memory/network-bandwidth configuration related to the parameter value of process_count*thread_count ?