rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.76k stars 341 forks source link

Download stall at the end #164

Closed xiankgx closed 2 years ago

xiankgx commented 2 years ago

I'm trying to download the CC3M dataset on an AWS Sagemaker Notebook instance. I first do pip install img2dataset. Then I fired up a terminal and do

img2dataset --url_list cc3m.tsv --input_format "tsv"\
         --url_col "url" --caption_col "caption" --output_format webdataset\
           --output_folder cc3m --processes_count 16 --thread_count 64 --resize_mode no\
             --enable_wandb False

Code runs and downloads but stalls towards the end. I tried terminating by restarting the instance (restart), as a result, some .tar files are having read error "Unexpected end of file" while using the tar files for training. I also tried to terminate it using Ctrl-C on a second run, which result in the same read error when using the tar files for training. The difference between two termination methods is the later seemed to do some cleanup which removed "_tmp" folder within the download folder.

xiankgx commented 2 years ago

I remember having one successful termination using Ctrl-C but with CLI argument --image_size 512 instead of --resize_mode no, but I can't be sure. The default command suggested at https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md creates padding around some images which make it not ideal for image generation.

rom1504 commented 2 years ago

How's your resource usage when it's stuck ? (CPU, network, disk) Did you setup knot resolver https://github.com/rom1504/img2dataset#setting-up-a-knot-resolver ?

You may choose keep_ratio mode if you want to resize without border.

rom1504 commented 2 years ago

Also you want to choose the process count to be like your number of cores

xiankgx commented 2 years ago

I did not setup a knot receiver. Regarding the process count to be number of cores, the instance I think should have matching specs. Also the weirld thing is every run usually seem to stall after it has downloaded everything it can. For both tries, the number of images downloaded is the same. For cc3m, both runs stopped around 2.7 plus mil. I also have one run for cc12m, which stalled after 11.5 mil.

xiankgx commented 2 years ago

Also, when I mean stalled, I mean progress seemed to have stop and stdout shows same number of images done. I believe the program is still running as I am able to terminate it with ctrl-c.

xiankgx commented 2 years ago

When this happens, in the download folder, the timestamp for the tar files can be long in the past like the latest modified file being few hours ago.

rom1504 commented 2 years ago

I see. Seems to be the same thing as https://github.com/rom1504/img2dataset/issues/74 which I wasn't able to reproduce in my environment. So that sounds like everything worked successfully and the process just won't stop.

I'm interested to be able to reproduce why it's getting stuck.

However in practice the output should be ok if you Ctrl+c

Do you see anything wrong about the output? Which tar files are broken ? (How do you observe it/what error?)

Are you using webdataset for loading the output ?

xiankgx commented 2 years ago

I'm using webdataset to use the files for training Dalle 2 models. Training stopped with dataloader process complaining of tar files abrupt end of file.

rom1504 commented 2 years ago

Can you share the errors ?

And can you use a loader with error handling like this https://github.com/rom1504/laion-prepro/blob/main/laion5B/usage_guide/dataloader_pytorch.py ?

xiankgx commented 2 years ago

Can you share the errors ?

And can you use a loader with error handling like this https://github.com/rom1504/laion-prepro/blob/main/laion5B/usage_guide/dataloader_pytorch.py ?

Epoch 0: : 70it [01:10,  1.01s/it, loss=0.28, v_num=0, image_embed_mse=0.505, text_pred_image_cos_sim=0.00777, text_pred_image_acc=0.0156] Traceback (most recent call last):                               | 57/100 [00:02<00:01, 29.77it/s]
  File "train_prior.py", line 126, in <module>█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 29.49it/s]
    main(args)time step:  93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊            | 93/100 [00:03<00:00, 29.52it/s]
  File "train_prior.py", line 85, in main██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 30.34it/s]
    trainer.fit(model, datamodule=dm)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 156, in advance
    batch_idx, (batch, self.batch_progress.is_last_batch) = next(self._dataloader_iter)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 203, in __next__
    return self.fetching_function()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 270, in fetching_function
    self._fetch_next_batch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch
    batch = next(self.dataloader_iter)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 550, in __next__
    return self.request_next_batch(self.loader_iters)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 562, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 96, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
tarfile.ReadError: Caught ReadError in DataLoader worker process 35.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 39, in fetch
    data = next(self.dataset_iter)
  File "/opt/conda/lib/python3.7/site-packages/webdataset/pipeline.py", line 68, in iterator
    for sample in self.iterator1():
  File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 478, in _batched
    for sample in data:
  File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 411, in _map_tuple
    for sample in data:
  File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 388, in _to_tuple
    for sample in data:
  File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 293, in _map
    for sample in data:
  File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 204, in _shuffle
    for sample in data:
  File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 152, in group_by_keys
    for filesample in data:
  File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 139, in tar_file_expander
    if handler(exn):
  File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 76, in reraise_exception
    raise exn
  File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 131, in tar_file_expander
    for sample in tar_file_iterator(source["stream"]):
  File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 114, in tar_file_iterator
    if handler(exn):
  File "/opt/conda/lib/python3.7/site-packages/webdataset/handlers.py", line 23, in reraise_exception
    raise exn
  File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 107, in tar_file_iterator
    data = stream.extractfile(tarinfo).read()
  File "/opt/conda/lib/python3.7/tarfile.py", line 697, in read
    raise ReadError("unexpected end of data")
tarfile.ReadError: ("unexpected end of data @ <_io.BufferedReader name='../datasets/cc3m/00035.tar'>", <_io.BufferedReader name='../datasets/cc3m/00035.tar'>, '../datasets/cc3m/00035.tar')

Exception ignored in: <function tqdm.__del__ at 0x7fb50d16ee60>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1152, in __del__
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1306, in close
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1499, in display
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1155, in __str__
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1457, in format_dict
TypeError: cannot unpack non-iterable NoneType object
xiankgx commented 2 years ago

Just an update, I was able to complete one round of cc3m download with the following parameters:

img2dataset --url_list cc3m.tsv --input_format "tsv"\
         --url_col "url" --caption_col "caption" --output_format webdataset\
           --output_folder cc3m_1024 --processes_count 16 --thread_count 64 --image_size 1024 --resize_mode keep_ratio\
             --enable_wandb False
xiankgx commented 2 years ago

I'm now rerunning the above with --resize_mode no to see if it is the culprit.

rom1504 commented 2 years ago

this should now be solved thanks to retrying feature, please update and try again