rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.62k stars 336 forks source link

The download process goes on forever #343

Open novruzgurbanov opened 1 year ago

novruzgurbanov commented 1 year ago

Hi! After downloading the files from laion2b-en with these parameters:

download(
        processes_count=32,
        url_list=parquet_file,
        resize_mode='no',
        output_folder=output_dir,
        output_format='webdataset', # Download files as a files 
        input_format='parquet',
        url_col="URL",
        caption_col="TEXT",
        number_sample_per_shard=50000,
        distributor='multiprocessing',
        )

all files will be downloaded (I think), but then the last iteration goes on forever and I have to stop manually. Could you look at this please?

P.S. I tried this function a month ago, and it worked seamlessly. But now, no matter what I do, no matter how simple parameters I defined, it stucks.

rom1504 commented 1 year ago

There is an other issue around this opened for a year which I can't reproduce

If you can figure out on which environment it happens it would help.

On Fri, Sep 1, 2023, 09:22 Gurbanov Novruz @.***> wrote:

Hi! After downloading the files from laion2b-en with these parameters:

    processes_count=32,
    url_list=parquet_file,
    resize_mode='no',
    output_folder=output_dir,
    output_format='webdataset', # Download files as a files
    input_format='parquet',
    url_col="URL",
    caption_col="TEXT",
    number_sample_per_shard=50000,
    distributor='multiprocessing',
    )

all files will be downloaded (I think), but then the last iteration goes on forever and I have to stop manually. Could you look at this please?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/343, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RWB5I7ZUN65S2WFY3XYGELXANCNFSM6AAAAAA4HEUUC4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

novruzgurbanov commented 1 year ago

@rom1504 I am running the download inside the docker container. Month ago, in the same docker container, it worked seamlessly. But now, I don't know why it cannot stop. I am not a pro about docker images, but if it is possible, maybe I can send you the image and you run a container and try to download some files? (img2dataset already installed)

rom1504 commented 1 year ago

I think it would be useful if you can try and figure out which specific docker config works vs which ones doesn't work

On Fri, Sep 1, 2023, 09:34 Gurbanov Novruz @.***> wrote:

@rom1504 https://github.com/rom1504 I am running the download inside the docker container. Month ago, in the same docker container, it worked seamlessly. But now, I don't know why it cannot stop. I am not a pro about docker images, but if it is possible, maybe I can send you the image and you run a container and try to download some files? (img2dataset already installed)

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/343#issuecomment-1702305048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TM3EE456RCCB2UD2TXYGFXLANCNFSM6AAAAAA4HEUUC4 . You are receiving this because you were mentioned.Message ID: @.***>

novruzgurbanov commented 1 year ago

@rom1504 Sorry, I quite didn't get what do you mean. If the container is same, the image is same, what other configs should I check for? If you have suggestion what to check, would appreciate!

rom1504 commented 1 year ago

You can check any other environment that works and then try to compare.

Maybe you changed the host if not the container?

On Fri, Sep 1, 2023, 09:43 Gurbanov Novruz @.***> wrote:

@rom1504 https://github.com/rom1504 Sorry, I quite didn't get what do you mean. If the container is same, the image is same, what other configs should I check for? If you have suggestion what to check, would appreciate!

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/343#issuecomment-1702316310, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437RFYVNSI7SFCMKDGD3XYGG3JANCNFSM6AAAAAA4HEUUC4 . You are receiving this because you were mentioned.Message ID: @.***>

novruzgurbanov commented 1 year ago

@rom1504 Interesting.. I downloaded files with the per shard parameter 10K, the download and the process finished on time. I guess, the function or something else cannot handle more shard per sample