Open AntreasAntoniou opened 1 year ago
Hi @AntreasAntoniou , sorry to know you are facing this issue. To help debugging it, could you tell me:
I'm cc-ing @lhoestq who might have some insights from a datasets
perspective.
One trick that can also help is to check the traceback when you kill your python process: it will show where in the code it was hanging
Right. So I did the trick @lhoestq suggested. Here is where things seem to hang
Error while uploading 'data/train-00120-of-00195-466c2dbab2eb9989.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3/3 [00:03<00:00, 1.15s/ba]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:52<00:00, 52.12s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3/3 [00:03<00:00, 1.08s/ba]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:45<00:00, 45.54s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3/3 [00:03<00:00, 1.08s/ba]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3/3 [00:03<00:00, 1.03s/ba^Upload 1 LFS files: 0%| | 0/1 [
21:27:35<?, ?it/s]
Pushing dataset shards to the dataset hub: 63%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 122/195 [23:37:11<14:07:59, 696.98s/it]
^CError in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1699, in print
extend(render(renderable, render_options))
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render
yield from self.render(render_output, _options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/constrain.py", line 29, in __rich_console__
yield from console.render(self.renderable, child_options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/panel.py", line 220, in __rich_console__
lines = console.render_lines(renderable, child_options, style=style)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines
lines = list(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines
for segment in segments:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/padding.py", line 97, in __rich_console__
lines = console.render_lines(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines
lines = list(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines
for segment in segments:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render
yield from self.render(render_output, _options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 611, in __rich_console__
segments = Segments(self._get_syntax(console, options))
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 668, in __init__
self.segments = list(segments)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 674, in _get_syntax
lines: Union[List[Text], Lines] = text.split("\n", allow_blank=ends_on_nl)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1042, in split
lines = Lines(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/containers.py", line 70, in __init__
self._lines: List["Text"] = list(lines)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1043, in <genexpr>
line for line in self.divide(flatten_spans()) if line.plain != separator
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 385, in plain
if len(self._text) != 1:
KeyboardInterrupt
Original exception was:
Traceback (most recent call last):
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 453, in result
self._condition.wait(timeout)
File "/opt/conda/envs/main/lib/python3.10/threading.py", line 320, in wait
waiter.acquire()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/TALI/tali/scripts/validate_dataset.py", line 127, in <module>
train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB")
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1583, in push_to_hub
repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5275, in _push_parquet_shards_to_hub
_retry(
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 282, in _retry
return func(*func_args, **func_kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner
return fn(self, *args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3205, in upload_file
commit_info = self.create_commit(
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner
return fn(self, *args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2680, in create_commit
upload_lfs_files(
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 353, in upload_lfs_files
thread_map(
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 49, in _executor_map
with PoolExecutor(max_workers=max_workers, initializer=tqdm_class.set_lock,
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 649, in __exit__
self.shutdown(wait=True)
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
t.join()
File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1096, in join
self._wait_for_tstate_lock()
File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
KeyboardInterrupt
@Wauplin
What is the total dataset size?
There are three variants, and the random hanging happens on all three. The sizes are 2TB, 1TB, and 200GB.
Is it always failing on the same shard or is the hanging problem happening randomly?
It seems to be very much random, as restarting can help move past the previous hang, only to find a new one, or not.
Were you able to save the dataset as parquet locally? This would help us determine if the problem comes from the upload or the file generation.
Yes. The dataset seems to be locally stored as parquet.
Hmm it looks like an issue with TQDM lock. Maybe you can try updating TQDM ?
I am using the latest version of tqdm
โฌข [Docker] โฏ pip install tqdm --upgrade
Requirement already satisfied: tqdm in /opt/conda/envs/main/lib/python3.10/site-packages (4.65.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
I tried trying to catch the hanging issue in action again
Pushing dataset shards to the dataset hub: 65%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 127/195 [2:28:02<1:19:15, 69.94s/it]
Error while uploading 'data/train-00127-of-00195-3f8d036ade107c27.parquet' to the Hub.
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 64%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 124/195 [2:06:10<1:12:14, 61.05s/it]C^[^C^C^C
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /TALI/tali/scripts/validate_dataset.py:127 in <module> โ
โ โ
โ 124 โ โ
โ 125 โ while not succesful_competion: โ
โ 126 โ โ try: โ
โ โฑ 127 โ โ โ train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB") โ
โ 128 โ โ โ succesful_competion = True โ
โ 129 โ โ except Exception as e: โ
โ 130 โ โ โ print(e) โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py:1583 in push_to_hub โ
โ โ
โ 1580 โ โ for split in self.keys(): โ
โ 1581 โ โ โ logger.warning(f"Pushing split {split} to the Hub.") โ
โ 1582 โ โ โ # The split=key needs to be removed before merging โ
โ โฑ 1583 โ โ โ repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parq โ
โ 1584 โ โ โ โ repo_id, โ
โ 1585 โ โ โ โ split=split, โ
โ 1586 โ โ โ โ private=private, โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5263 in โ
โ _push_parquet_shards_to_hub โ
โ โ
โ 5260 โ โ โ
โ 5261 โ โ uploaded_size = 0 โ
โ 5262 โ โ shards_path_in_repo = [] โ
โ โฑ 5263 โ โ for index, shard in logging.tqdm( โ
โ 5264 โ โ โ enumerate(itertools.chain([first_shard], shards_iter)), โ
โ 5265 โ โ โ desc="Pushing dataset shards to the dataset hub", โ
โ 5266 โ โ โ total=num_shards, โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py:1178 in __iter__ โ
โ โ
โ 1175 โ โ time = self._time โ
โ 1176 โ โ โ
โ 1177 โ โ try: โ
โ โฑ 1178 โ โ โ for obj in iterable: โ
โ 1179 โ โ โ โ yield obj โ
โ 1180 โ โ โ โ # Update and possibly print the progressbar. โ
โ 1181 โ โ โ โ # Note: does not call self.update(1) for speed optimisation. โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5238 in โ
โ shards_with_embedded_external_files โ
โ โ
โ 5235 โ โ โ โ for shard in shards: โ
โ 5236 โ โ โ โ โ format = shard.format โ
โ 5237 โ โ โ โ โ shard = shard.with_format("arrow") โ
โ โฑ 5238 โ โ โ โ โ shard = shard.map( โ
โ 5239 โ โ โ โ โ โ embed_table_storage, โ
โ 5240 โ โ โ โ โ โ batched=True, โ
โ 5241 โ โ โ โ โ โ batch_size=1000, โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:578 in wrapper โ
โ โ
โ 575 โ โ else: โ
โ 576 โ โ โ self: "Dataset" = kwargs.pop("self") โ
โ 577 โ โ # apply actual function โ
โ โฑ 578 โ โ out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) โ
โ 579 โ โ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou โ
โ 580 โ โ for dataset in datasets: โ
โ 581 โ โ โ # Remove task templates if a column mapping of the template is no longer val โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:543 in wrapper โ
โ โ
โ 540 โ โ โ "output_all_columns": self._output_all_columns, โ
โ 541 โ โ } โ
โ 542 โ โ # apply actual function โ
โ โฑ 543 โ โ out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) โ
โ 544 โ โ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou โ
โ 545 โ โ # re-apply format to the output โ
โ 546 โ โ for dataset in datasets: โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3073 in map โ
โ โ
โ 3070 โ โ โ โ โ leave=False, โ
โ 3071 โ โ โ โ โ desc=desc or "Map", โ
โ 3072 โ โ โ โ ) as pbar: โ
โ โฑ 3073 โ โ โ โ โ for rank, done, content in Dataset._map_single(**dataset_kwargs): โ
โ 3074 โ โ โ โ โ โ if done: โ
โ 3075 โ โ โ โ โ โ โ shards_done += 1 โ
โ 3076 โ โ โ โ โ โ โ logger.debug(f"Finished processing shard number {rank} of {n โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3464 in _map_single โ
โ โ
โ 3461 โ โ โ โ โ โ โ โ buf_writer, writer, tmp_file = init_buffer_and_writer() โ
โ 3462 โ โ โ โ โ โ โ โ stack.enter_context(writer) โ
โ 3463 โ โ โ โ โ โ โ if isinstance(batch, pa.Table): โ
โ โฑ 3464 โ โ โ โ โ โ โ โ writer.write_table(batch) โ
โ 3465 โ โ โ โ โ โ โ else: โ
โ 3466 โ โ โ โ โ โ โ โ writer.write_batch(batch) โ
โ 3467 โ โ โ โ โ โ num_examples_progress_update += num_examples_in_batch โ
โ โ
โ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_writer.py:567 in write_table โ
โ โ
โ 564 โ โ โ writer_batch_size = self.writer_batch_size โ
โ 565 โ โ if self.pa_writer is None: โ
โ 566 โ โ โ self._build_writer(inferred_schema=pa_table.schema) โ
โ โฑ 567 โ โ pa_table = pa_table.combine_chunks() โ
โ 568 โ โ pa_table = table_cast(pa_table, self._schema) โ
โ 569 โ โ if self.embed_local_files: โ
โ 570 โ โ โ pa_table = embed_table_storage(pa_table) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
KeyboardInterrupt
I'm on my phone so can't help that much. What I'd advice to do is to save_to_disk if it's not already done and then upload the files/folder to the Hub separately. You can find what you need in the upload guide. It might not help finding the exact issue for now but at least it can unblock you.
In your last stacktrace it interrupted while embedding external content - in case your dataset in made of images or audio files that live on your disk. Is it the case ?
Yeah, the dataset has images, audio, video and text.
It's maybe related to https://github.com/apache/arrow/issues/34455: are you using ArrayND features ?
Also what's your pyarrow
version ? Could you try updating to >= 12.0.1 ?
I was using pyarrow == 12.0.0
I am not explicitly using ArrayND features, unless the hub API automatically converts my files to such.
I have now updated to pyarrow == 12.0.1 and retrying
You can also try to reduce the max_shard_size
- Sometimes parquet has a hard time working with data bigger than 2GB
So, updating the pyarrow seems to help. It can still throw errors here and there but I can retry when that happens. It's better than hanging.
However, I am a bit confused about something. I have uploaded my datasets, but while earlier I could see all three sets, now I can only see 1. What's going on? https://huggingface.co/datasets/Antreas/TALI-base
I have seen this happen before as well, so I deleted and reuploaded, but this dataset is way too large for me to do this.
It's a bug on our side, I'll update the dataset viewer ;)
Thanks for reporting !
Apparently this happened because of bad modifications in the README.md split metadata.
I fixed them in this PR: https://huggingface.co/datasets/Antreas/TALI-base/discussions/1
@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.
Also, just found another related issue. One of the many that make things hang or fail when pushing to hub.
In the following code:
train_generator = lambda: data_generator("train", percentage=1.0)
val_generator = lambda: data_generator("val")
test_generator = lambda: data_generator("test")
train_data = datasets.Dataset.from_generator(
train_generator,
num_proc=mp.cpu_count(),
writer_batch_size=5000,
cache_dir=tali_dataset_dir,
)
val_data = datasets.Dataset.from_generator(
val_generator,
writer_batch_size=5000,
num_proc=mp.cpu_count(),
cache_dir=tali_dataset_dir,
)
test_data = datasets.Dataset.from_generator(
test_generator,
writer_batch_size=5000,
num_proc=mp.cpu_count(),
cache_dir=tali_dataset_dir,
)
print(f"Pushing TALI-large to hub")
dataset = datasets.DatasetDict(
{"train": train_data, "val": val_data, "test": test_data}
)
succesful_competion = False
while not succesful_competion:
try:
dataset.push_to_hub(repo_id="Antreas/TALI-large", max_shard_size="2GB")
succesful_competion = True
except Exception as e:
print(e)
Things keep failing in the push_to_repo step, at random places, with the following error:
Pushing dataset shards to the dataset hub: 7%|โโโโโโโโโโโ | 67/950 [42:41<9:22:37, 38.23s/it]
Error while uploading 'data/train-00067-of-00950-a4d179ed5a593486.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:01<00:00, 1.81ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:11<00:00, 11.20s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.48ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:15<00:00, 15.30s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.39ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:11<00:00, 11.52s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.47ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.39s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.26ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub: 7%|โโโโโโโโโโโโ | 71/950 [44:37<9:12:28, 37.71s/it]
Error while uploading 'data/train-00071-of-00950-72bab6e5cb223aee.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.18ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.94s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.36ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.57ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.16s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.68ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:09<00:00, 9.63s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.36ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.37ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub: 8%|โโโโโโโโโโโโ | 76/950 [46:21<8:53:08, 36.60s/it]
Error while uploading 'data/train-00076-of-00950-b90e4e3b433db179.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.21ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:25<00:00, 25.40s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:01<00:00, 1.56ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.40s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.49ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:23<00:00, 23.53s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.27ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.25s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.42ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:11<00:00, 11.03s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.39ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub: 9%|โโโโโโโโโโโโโ | 81/950 [48:30<8:40:22, 35.93s/it]
Error while uploading 'data/train-00081-of-00950-84b0450a1df093a9.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.18ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:11<00:00, 11.65s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:01<00:00, 1.92ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub: 9%|โโโโโโโโโโโโโ | 82/950 [48:55<8:37:57, 35.80s/it]
Error while uploading 'data/train-00082-of-00950-0a1f52da35653e08.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.31ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:26<00:00, 26.29s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.42ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.57s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.64ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:10<00:00, 10.35s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.64ba/s]
Upload 1 LFS files: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:11<00:00, 11.74s/it]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2/2 [00:00<00:00, 2.31ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:40<?, ?it/s]
Pushing dataset shards to the dataset hub: 9%|โโโโโโโโโโโโโโ | 86/950 [50:48<8:30:25, 35.45s/it]
Error while uploading 'data/train-00086-of-00950-e1cc80dd17191b20.parquet' to the Hub.
I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.
Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?
Thank you for your help and time.
@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.
Hmm this shouldn't happen. What code did you run exactly ? Using which version of datasets
?
I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.
Could you also print the cause of the error (e.__cause__
) ? Or show the full stack trace when the error happens ?
This would give more details about why it failed and would help investigate.
Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?
Parquet is supported out of the box ^^
If you want to make sure it works as expected you can try locally first:
ds = load_dataset("path/to/local", streaming=True)
@lhoestq @AntreasAntoniou I transferred this issue to the datasets
repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating tqdm and pyarrow or setting a lower max_shard_size
).
~For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk
first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.~
EDIT: removed suggestion about saving to disk first (see https://github.com/huggingface/datasets/issues/5990#issuecomment-1607186914).
@lhoestq @AntreasAntoniou I transferred this issue to the datasets repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120204 and https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120278 or https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120328).
thanks :)
For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.
As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk
to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk
, which is meant for disk only.
As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk, which is meant for disk only.
Well noted, thanks. That part was not clear to me :)
Sorry for not replying in a few days, I was on leave. :)
So, here are more information as to the error that causes some of the delay
Pushing Antreas/TALI-tiny to hub
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:24<00:00, 4.06s/ba]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:24<00:00, 4.15s/ba]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:26<00:00, 4.45s/ba]
/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py:310: UserWarning: hf_transfer is enabled but does not support uploading from bytes or BinaryIO, falling back to regular upload
warnings.warn(
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:25<00:00, 4.26s/ba]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:27<00:00, 4.58s/ba]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6/6 [00:24<00:00, 4.10s/ba]
Pushing dataset shards to the dataset hub: 22%|โโโโโโโโโโโโโโโโโโโโโโโโโ | 5/23 [52:23<3:08:37, 628.74s/it]
Exception: Error while uploading 'data/train-00005-of-00023-e224d901fd65e062.parquet' to the Hub., with stacktrace: <traceback object at 0x7f745458d0c0>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/7c/d3/7cd385d9324302dc13e3986331d72d9be6fa0174c63dcfe0e08cd474f7f1e8b7/3415166ae28c0beccbbc692f38742b8dea2c197f5c805321104e888d21d7eb90?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230627%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230627T003349Z&X-Amz-Expires=86400&X-Amz-Signature=5a12ff96f2
91f644134170992a6628e5f3c4e7b2e7fc3e940b4378fe11ae5390&X-Amz-SignedHeaders=host&partNumber=1&uploadId=JSsK8r63XSF.VlKQx3Vf8OW4DEVp5YIIY7LPnuapNIegsxs5EHgM1p4u0.Nn6_wlPlQnvxm8HKMxZhczKE9KB74t0etB
oLcxqBIvsgey3uXBTZMAEGwU6y7CDUADiEIO&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
One issue is that the uploading does not continue from the chunk it failed off. It often continues from a very old chunk. e.g. if it failed on chunk 192/250, it will continue from say 53/250, and this behaviour appears almost random.
Are you using a proxy of some sort ?
I am using a kubernetes cluster built into a university VPN.
So, other than the random connection drops here and there, any idea why the progress does not continue where it left off?
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 10.79ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.65ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.39ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.04ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.52ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 12.28ba/s]
Pushing dataset shards to the dataset hub: 20%|โโโโโโโโโโโโโโโโโโโโโโ | 75/381 [1:34:39<6:26:11, 75.72s/it]
Exception: Error while uploading 'data/train-00075-of-00381-1614bc251b778766.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab6d9a4980>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/ed8dae933fb79ae1ef5fb1f698f5125d3e1c02977ac69438631f152bb3bfdd1e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T053004Z&X-Amz-Expires=86400&X-Amz-Signature=da2b26270edfd6d0
d069c015a5a432031107a8664c3f0917717e5e40c688183c&X-Amz-SignedHeaders=host&partNumber=1&uploadId=2erWGHTh3ICqBLU_QvHfnygZ2tkMWbL0rEqpJdYohCKHUHnfwMjvoBIg0TI_KSGn4rSKxUxOyqSIzFUFSRSzixZeLeneaXJOw.Qx8
zLKSV5xV7HRQDj4RBesNve6cSoo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 12.09ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 11.51ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 10.77ba/s]
Pushing dataset shards to the dataset hub: 20%|โโโโโโโโโโโโโโโโโโโโโโโ | 77/381 [1:32:50<6:06:34, 72.35s/it]
Exception: Error while uploading 'data/train-00077-of-00381-368b2327a9908aab.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45b27f80>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/9462ff2c5e61283b53b091984a22de2f41a2f6e37b681171e2eca4a998f979cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T070510Z&X-Amz-Expires=86400&X-Amz-Signature=9ab8487b93d443cd
21f05476405855d46051a0771b4986bbb20f770ded21b1a4&X-Amz-SignedHeaders=host&partNumber=1&uploadId=UiHX1B.DcoAO2QmIHpWpCuNPwhXU_o1dsTkTGPqZt1P51o9k0yz.EsFD9eKpQMwgAST3jOatRG78I_JWRBeLBDYYVNp8r0TpIdeSg
eUg8uwPZOCPw9y5mWOw8MWJrnBo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 8%|โโโโโโโโโ | 29/381 [27:39<5:50:03, 59.67s/it]
Map: 36%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1000/2764 [00:35<00:34, 51.63 examples/Map: 72%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2000/2764 [00:40<00:15, 49.06 examples/Map: 72%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 2000/2764 [00:55<00:15, 49.06 examples/Map: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 2764/2764 [00:56<00:00, 48.82 examples/Pushing dataset shards to the dataset hub: 8%|โโโโโโโโโ | 30/381 [28:35<5:43:03, 58.64s/iPushing dataset shards to the dataset hub: 8%|โโโโโโโโโโ | 31/381 [29:40<5:52:18, 60.40s/iPushing dataset shards to the dataset hub: 8%|โโโโโโโโโโ | 32/381 [30:46<6:02:20, 62.29s/it]
Map: 36%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
This is actually the issue that wastes the most time for me, and I need it fixed. Please advice on how I can go about it.
Notice how the progress goes from | 77/381 to 30/381
If the any shard is missing on the Hub, it will re-upload it. It looks like the 30th shard was missing on the Hub in your case.
It also means that the other files up to the 77th that were successfully uploaded won't be uploaded again.
cc @mariosasko who might know better
@lhoestq That can't be right. The 30th shard was successfully pushed earlier. I confirmed that at the time.
It somehow went back to 22 now.
Pushing dataset shards to the dataset hub: 20%|โโโโโโโโโโโโโโโโโโโโโโโ | 78/381 [1:16:47<5:43:43, 68.06s/iCreating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 12.95ba/s]
Pushing dataset shards to the dataset hub: 21%|โโโโโโโโโโโโโโโโโโโโโโโโ | 79/381 [1:18:16<6:15:34, 74.62s/iCreating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.29ba/s]
Pushing dataset shards to the dataset hub: 21%|โโโโโโโโโโโโโโโโโโโโโโโโ | 80/381 [1:19:39<6:25:33, 76.86s/iCreating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 11.94ba/s]
^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[Pushing dataset shards to the dataset hub: 21%|โโโโโโโโโโโโโโโโโโโโโโโโ | 80/381 [1:37:18<6:06:06, 72.98s/it]
Exception: Error while uploading 'data/train-00080-of-00381-062438dd5e7ca2d7.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45ba0080>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/c6b3b2de546aa432c14341a4f7691dd7518ac49dc2a5635b47937dd59007b93b?X-Amz-Algorithm=AWS4-HMAC-SHA256&
X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T084450Z&X-Amz-Expires=86400&X-Amz-Signature=98986fc1300e
e3f47e9bc9d7f1ee8b303d5ed3e1959d9fa988cdc5c49c457054&X-Amz-SignedHeaders=host&partNumber=1&uploadId=AWFFr6YCiEl.uXo8.EP00v9KlT7z_atlfnuI.DA1zzDf3sq2OY5HabWAQ480nnajYvJdHYif3.YCJxTTmtATT3_pfQBjwTc
4AsIRPaip5blkRVINhe69WyPo_sreoHdv&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.46ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 12.54ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.08ba/s]
Creating parquet from Arrow format: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 28/28 [00:02<00:00, 13.16ba/s]
Pushing dataset shards to the dataset hub: 22%|โโโโโโโโโโโโโโโโโโโโโโโโ | 83/381 [1:40:31<6:00:54, 72.67s/it]
Exception: Error while uploading 'data/train-00083-of-00381-7f61e92530de6c6f.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45b27f80>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/a2efe54ae3c5eaa5161fef804a3f633a333e9336560d879ab1dcc684ac5f298f?X-Amz-Algorithm=AWS4-HMAC-SHA256&
X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T102743Z&X-Amz-Expires=86400&X-Amz-Signature=0b3d008d3e39
20efa9ecf18ef4d896b2d1d82e6e67f4ed33770e7e8896b738f6&X-Amz-SignedHeaders=host&partNumber=1&uploadId=1uab1rS4FApXg_6J7WIU6papbUY2Cm1W8cla15LeqUvbDyDm_3_BQzMkiOhqBt2odhoqTZPKO8uY0zQ3XWOXSeezPumliRw
CJbCnRDt__xowcCZp2.NZklf2kpuPJxB_&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 6%|โโโโโโโ | 22/381 [20:47<6:10:15, 61.88s/it]
Map: 36%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1000/2764 [00:22<00:40, 44.01 examples/s
I see, maybe there's a bug that would cause the fingerprint to not be deterministic then. Sorry for the inconvenience, we'll investigate
This is where I am uploading to https://huggingface.co/datasets/Antreas/TALI-large-2
According to the commits list, the files are being uploaded in order without duplicates. No file is uploaded twice.
Therefore it seems there is something else that is blocking the resuming progress mid-way somehow.
For each shard we first load the media files (e.g. images, audio) into the Arrow data before uploading. Could it be this step that is hanging ? If you try to interrupt the program when it hangs at a shard that has already been uploaded, what does the stacktrace say ? It could help locate what part in the code is blocking it.
I'll do that when it hangs. Meanwhile, if the files are uploaded in order, why can't the process automatically see that locally? It's remapping all the shards, and then I guess 'pushes' them, but sees that they exist and moves to the next. However, the processing time of remapping is significant. Is there a way to fix this? To avoid this remapping process for shards that have already been pushed.
I agree we should ideally check if the file has been uploaded before embedding the media files indeed !
Can you point me to the relevant part of the code? I wouldn't mind taking care of this. :)
Sure ! The for
loop that iterates on the shards to upload and check if the file has already been uploaded is here:
and the code that applies the external files embedding to arrow is a few lines earlier:
I think one way to make it work would be to call path_in_repo
and check if the file is in the repository before calling map
@lhoestq This took a while due to vacations, but I now have a working draft at https://github.com/huggingface/datasets/pull/6056
If you could review and comment that'd be great!
This comment has the code that you can run to avoid rerunning the "embed external data" step.
Also, as mentioned in the comment, these bytes will be embedded automatically in Datasets 3.0 to, among other things, make push_to_hub
faster.
I just tried the latest version of datasets, with the push_to_hub function, on a large dataset, things still seem to hang, here is the context after forcibly killing the process with control + c
Starting preparation and upload with arguments dataset_name: Antreas/TALI-big-2.0, data_percentage: 1.0, num_data_samples: None, max_shard_size: 10GB, num_workers: 1
Map: 100%|________________________________________________________________________________________________________________________________| 2633/2633 [00:51<00:00, 50.84 examples/s]
Creating parquet from Arrow format: 100%|____________________________________________________________________________________________________________| 27/27 [01:35<00:00, 3.53s/ba]
Map: 100%|________________________________________________________________________________________________________________________________| 2633/2633 [01:03<00:00, 41.22 examples/s]
Creating parquet from Arrow format: 100%|____________________________________________________________________________________________________________| 27/27 [01:32<00:00, 3.44s/ba]
Pushing dataset shards to the dataset hub: 0%|_ | 1/400 [19:45<131:21:16, 1185.15s/it]
^CError in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1699, in print
extend(render(renderable, render_options))
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render
yield from self.render(render_output, _options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/constrain.py", line 29, in __rich_console__
yield from console.render(self.renderable, child_options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/panel.py", line 220, in __rich_console__
lines = console.render_lines(renderable, child_options, style=style)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines
lines = list(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines
for segment in segments:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/padding.py", line 97, in __rich_console__
lines = console.render_lines(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines
lines = list(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines
for segment in segments:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render
yield from self.render(render_output, _options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 611, in __rich_console__
segments = Segments(self._get_syntax(console, options))
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 668, in __init__
self.segments = list(segments)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 639, in _get_syntax
text = self.highlight(processed_code, self.line_range)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 470, in highlight
lexer = self.lexer
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 433, in lexer
return get_lexer_by_name(
File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexers/__init__.py", line 126, in get_lexer_by_name
return _lexer_cache[name](**options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 641, in __call__
cls._tokens = cls.process_tokendef('', cls.get_tokendefs())
File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 580, in process_tokendef
cls._process_state(tokendefs, processed, state)
File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 543, in _process_state
tokens.extend(cls._process_state(unprocessed, processed,
File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 543, in _process_state
tokens.extend(cls._process_state(unprocessed, processed,
File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 559, in _process_state
rex = cls._process_regex(tdef[0], rflags, state)
File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 488, in _process_regex
return re.compile(regex, rflags).match
File "/opt/conda/envs/main/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/opt/conda/envs/main/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 792, in compile
code = _code(p, flags)
File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 631, in _code
_compile(code, p.data, flags)
File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 136, in _compile
charset, hascased = _optimize_charset(av, iscased, tolower, fixes)
File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 328, in _optimize_charset
charmap[i] = 1
KeyboardInterrupt
Original exception was:
Traceback (most recent call last):
File "/root/TALI/tali/scripts/upload_dataset_from_disk_to_hf.py", line 73, in <module>
fire.Fire(main)
File "/opt/conda/envs/main/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/envs/main/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/envs/main/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/root/TALI/tali/scripts/upload_dataset_from_disk_to_hf.py", line 62, in main
dataset.push_to_hub(
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1641, in push_to_hub
repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5307, in _push_parquet_shards_to_hub
_retry(
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 290, in _retry
return func(*func_args, **func_kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner
return fn(self, *args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3221, in upload_file
commit_info = self.create_commit(
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner
return fn(self, *args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2695, in create_commit
upload_lfs_files(
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 393, in upload_lfs_files
_wrapped_lfs_upload(filtered_actions[0])
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 383, in _wrapped_lfs_upload
lfs_upload(operation=operation, lfs_batch_action=batch_action, token=token)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 223, in lfs_upload
_upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_action["href"])
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 319, in _upload_multi_part
else _upload_parts_iteratively(operation=operation, sorted_parts_urls=sorted_parts_urls, chunk_size=chunk_size)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 375, in _upload_parts_iteratively
part_upload_res = http_backoff("PUT", part_upload_url, data=fileobj_slice)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 258, in http_backoff
response = session.request(method=method, url=url, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 63, in send
return super().send(request, *args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/opt/conda/envs/main/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/opt/conda/envs/main/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/opt/conda/envs/main/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1283, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1329, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1278, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1077, in _send_output
self.send(chunk)
File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 999, in send
self.sock.sendall(data)
File "/opt/conda/envs/main/lib/python3.10/ssl.py", line 1237, in sendall
v = self.send(byte_view[count:])
File "/opt/conda/envs/main/lib/python3.10/ssl.py", line 1206, in send
return self._sslobj.write(data)
KeyboardInterrupt
^C^C^C_
It seems to hang during the PUT request to upload the data. Can you check your network ?
Having this same issue now. With image/textual data
Describe the bug
Once I have locally built a large dataset that I want to push to hub, I use the recommended approach of .push_to_hub to get the dataset on the hub, and after pushing a few shards, it consistently hangs. This has happened over 40 times over the past week, and despite my best efforts to try and catch this happening and kill a process and restart, it seems to be extremely time wasting -- so I came to you to report this and to seek help.
I already tried installing hf_transfer, but it doesn't support Byte file uploads so I uninstalled it.
Reproduction
Logs
System info