Closed xiankgx closed 2 years ago
I remember having one successful termination using Ctrl-C but with CLI argument --image_size 512 instead of --resize_mode no, but I can't be sure. The default command suggested at https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md creates padding around some images which make it not ideal for image generation.
How's your resource usage when it's stuck ? (CPU, network, disk) Did you setup knot resolver https://github.com/rom1504/img2dataset#setting-up-a-knot-resolver ?
You may choose keep_ratio mode if you want to resize without border.
Also you want to choose the process count to be like your number of cores
I did not setup a knot receiver. Regarding the process count to be number of cores, the instance I think should have matching specs. Also the weirld thing is every run usually seem to stall after it has downloaded everything it can. For both tries, the number of images downloaded is the same. For cc3m, both runs stopped around 2.7 plus mil. I also have one run for cc12m, which stalled after 11.5 mil.
Also, when I mean stalled, I mean progress seemed to have stop and stdout shows same number of images done. I believe the program is still running as I am able to terminate it with ctrl-c.
When this happens, in the download folder, the timestamp for the tar files can be long in the past like the latest modified file being few hours ago.
I see. Seems to be the same thing as https://github.com/rom1504/img2dataset/issues/74 which I wasn't able to reproduce in my environment. So that sounds like everything worked successfully and the process just won't stop.
I'm interested to be able to reproduce why it's getting stuck.
However in practice the output should be ok if you Ctrl+c
Do you see anything wrong about the output? Which tar files are broken ? (How do you observe it/what error?)
Are you using webdataset for loading the output ?
I'm using webdataset to use the files for training Dalle 2 models. Training stopped with dataloader process complaining of tar files abrupt end of file.
Can you share the errors ?
And can you use a loader with error handling like this https://github.com/rom1504/laion-prepro/blob/main/laion5B/usage_guide/dataloader_pytorch.py ?
Can you share the errors ?
And can you use a loader with error handling like this https://github.com/rom1504/laion-prepro/blob/main/laion5B/usage_guide/dataloader_pytorch.py ?
Epoch 0: : 70it [01:10, 1.01s/it, loss=0.28, v_num=0, image_embed_mse=0.505, text_pred_image_cos_sim=0.00777, text_pred_image_acc=0.0156] Traceback (most recent call last): | 57/100 [00:02<00:01, 29.77it/s]
File "train_prior.py", line 126, in <module>█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 29.49it/s]
main(args)time step: 93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 93/100 [00:03<00:00, 29.52it/s]
File "train_prior.py", line 85, in main██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 30.34it/s]
trainer.fit(model, datamodule=dm)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 156, in advance
batch_idx, (batch, self.batch_progress.is_last_batch) = next(self._dataloader_iter)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 203, in __next__
return self.fetching_function()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 270, in fetching_function
self._fetch_next_batch()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch
batch = next(self.dataloader_iter)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 550, in __next__
return self.request_next_batch(self.loader_iters)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 562, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 96, in apply_to_collection
return function(data, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
raise exception
tarfile.ReadError: Caught ReadError in DataLoader worker process 35.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 39, in fetch
data = next(self.dataset_iter)
File "/opt/conda/lib/python3.7/site-packages/webdataset/pipeline.py", line 68, in iterator
for sample in self.iterator1():
File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 478, in _batched
for sample in data:
File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 411, in _map_tuple
for sample in data:
File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 388, in _to_tuple
for sample in data:
File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 293, in _map
for sample in data:
File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 204, in _shuffle
for sample in data:
File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 152, in group_by_keys
for filesample in data:
File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 139, in tar_file_expander
if handler(exn):
File "/opt/conda/lib/python3.7/site-packages/webdataset/filters.py", line 76, in reraise_exception
raise exn
File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 131, in tar_file_expander
for sample in tar_file_iterator(source["stream"]):
File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 114, in tar_file_iterator
if handler(exn):
File "/opt/conda/lib/python3.7/site-packages/webdataset/handlers.py", line 23, in reraise_exception
raise exn
File "/opt/conda/lib/python3.7/site-packages/webdataset/tariterators.py", line 107, in tar_file_iterator
data = stream.extractfile(tarinfo).read()
File "/opt/conda/lib/python3.7/tarfile.py", line 697, in read
raise ReadError("unexpected end of data")
tarfile.ReadError: ("unexpected end of data @ <_io.BufferedReader name='../datasets/cc3m/00035.tar'>", <_io.BufferedReader name='../datasets/cc3m/00035.tar'>, '../datasets/cc3m/00035.tar')
Exception ignored in: <function tqdm.__del__ at 0x7fb50d16ee60>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1152, in __del__
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1306, in close
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1499, in display
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1155, in __str__
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1457, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Just an update, I was able to complete one round of cc3m download with the following parameters:
img2dataset --url_list cc3m.tsv --input_format "tsv"\
--url_col "url" --caption_col "caption" --output_format webdataset\
--output_folder cc3m_1024 --processes_count 16 --thread_count 64 --image_size 1024 --resize_mode keep_ratio\
--enable_wandb False
I'm now rerunning the above with --resize_mode no to see if it is the culprit.
this should now be solved thanks to retrying feature, please update and try again
I'm trying to download the CC3M dataset on an AWS Sagemaker Notebook instance. I first do pip install img2dataset. Then I fired up a terminal and do
Code runs and downloads but stalls towards the end. I tried terminating by restarting the instance (restart), as a result, some .tar files are having read error "Unexpected end of file" while using the tar files for training. I also tried to terminate it using Ctrl-C on a second run, which result in the same read error when using the tar files for training. The difference between two termination methods is the later seemed to do some cleanup which removed "_tmp" folder within the download folder.