Pushing a large dataset on the hub consistently hangs

Describe the bug

Once I have locally built a large dataset that I want to push to hub, I use the recommended approach of .push_to_hub to get the dataset on the hub, and after pushing a few shards, it consistently hangs. This has happened over 40 times over the past week, and despite my best efforts to try and catch this happening and kill a process and restart, it seems to be extremely time wasting -- so I came to you to report this and to seek help.

I already tried installing hf_transfer, but it doesn't support Byte file uploads so I uninstalled it.

Reproduction

import multiprocessing as mp
import pathlib
from math import ceil

import datasets
import numpy as np
from tqdm.auto import tqdm

from tali.data.data import select_subtitles_between_timestamps
from tali.utils import load_json

tali_dataset_dir = "/data/"

if __name__ == "__main__":
    full_dataset = datasets.load_dataset(
        "Antreas/TALI", num_proc=mp.cpu_count(), cache_dir=tali_dataset_dir
    )

    def data_generator(set_name, percentage: float = 1.0):
        dataset = full_dataset[set_name]

        for item in tqdm(dataset):
            video_list = item["youtube_content_video"]
            video_list = np.random.choice(
                video_list, int(ceil(len(video_list) * percentage))
            )
            if len(video_list) == 0:
                continue
            captions = item["youtube_subtitle_text"]
            captions = select_subtitles_between_timestamps(
                subtitle_dict=load_json(
                    captions.replace(
                        "/data/",
                        tali_dataset_dir,
                    )
                ),
                starting_timestamp=0,
                ending_timestamp=100000000,
            )

            for video_path in video_list:
                temp_path = video_path.replace("/data/", tali_dataset_dir)
                video_path_actual: pathlib.Path = pathlib.Path(temp_path)

                if video_path_actual.exists():
                    item["youtube_content_video"] = open(video_path_actual, "rb").read()
                    item["youtube_subtitle_text"] = captions
                    yield item

    train_generator = lambda: data_generator("train", percentage=0.1)
    val_generator = lambda: data_generator("val")
    test_generator = lambda: data_generator("test")

    train_data = datasets.Dataset.from_generator(
        train_generator,
        num_proc=mp.cpu_count(),
        writer_batch_size=5000,
        cache_dir=tali_dataset_dir,
    )

    val_data = datasets.Dataset.from_generator(
        val_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    test_data = datasets.Dataset.from_generator(
        test_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    dataset = datasets.DatasetDict(
        {
            "train": train_data,
            "val": val_data,
            "test": test_data,
        }
    )
    succesful_competion = False
    while not succesful_competion:
        try:
            dataset.push_to_hub(repo_id="Antreas/TALI-small", max_shard_size="5GB")
            succesful_competion = True
        except Exception as e:
            print(e)

Logs

Pushing dataset shards to the dataset hub:  33%|██████████████████████████████████████▎                                                                            | 7/21 [24:33<49:06, 210.45s/it]
Error while uploading 'data/val-00007-of-00021-6b216a984af1a4c8.parquet' to the Hub.                                                                                                               
Pushing split train to the Hub.                                                                                                                                                                    
Resuming upload of the dataset shards.                                                                                                                                                             
Pushing dataset shards to the dataset hub: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [42:10<00:00, 55.01s/it]
Pushing split val to the Hub.                                                                                                                                                                      
Resuming upload of the dataset shards.                                                                                                                                                             
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.55ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.51s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.39ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:30<00:00, 30.19s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.28ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:24<00:00, 24.08s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.42ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.97s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.49ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.54ba/s^
Upload 1 LFS files:   0%|                                                                                                                                                    | 0/1 [04:42<?, ?it/s]
Pushing dataset shards to the dataset hub:  52%|████████████████████████████████████████████████████████████▏                                                      | 11/21 [17:23<15:48, 94.82s/it]

That's where it got stuck

System info

- huggingface_hub version: 0.15.1
- Platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.35
- Python version: 3.10.11
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: Antreas
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.1.0.dev20230606+cu121
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.5.0
- hf_transfer: N/A
- gradio: N/A
- numpy: 1.24.3
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /root/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

Hi @AntreasAntoniou , sorry to know you are facing this issue. To help debugging it, could you tell me:

What is the total dataset size?
Is it always failing on the same shard or is the hanging problem happening randomly?
Were you able to save the dataset as parquet locally? This would help us determine if the problem comes from the upload or the file generation.

I'm cc-ing @lhoestq who might have some insights from a datasets perspective.

One trick that can also help is to check the traceback when you kill your python process: it will show where in the code it was hanging

Right. So I did the trick @lhoestq suggested. Here is where things seem to hang

Error while uploading 'data/train-00120-of-00195-466c2dbab2eb9989.parquet' to the Hub.                                                                                                     
Pushing split train to the Hub.                                                                                                                                                            
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.15s/ba]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.12s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.08s/ba]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:45<00:00, 45.54s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.08s/ba]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.03s/ba^Upload 1 LFS files:   0%|                                                                                                                                         | 0/1 [
21:27:35<?, ?it/s]                                                                                                                                                                         
Pushing dataset shards to the dataset hub:  63%|█████████████████████████████████████████████████████████████▎                                    | 122/195 [23:37:11<14:07:59, 696.98s/it]
^CError in sys.excepthook:                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1699, in print                                                                                            
    extend(render(renderable, render_options))                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render                                                                                           
    yield from self.render(render_output, _options)                                                                                                                                        
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/constrain.py", line 29, in __rich_console__                                                                                 
    yield from console.render(self.renderable, child_options)                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/panel.py", line 220, in __rich_console__                                                                                    
    lines = console.render_lines(renderable, child_options, style=style)                                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines                                                                                     
    lines = list(                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines                                                                              
    for segment in segments:                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/padding.py", line 97, in __rich_console__                                                                                   
    lines = console.render_lines(                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines                                                                                     
    lines = list(                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines                                                                              
    for segment in segments:                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render                                                                                           
    yield from self.render(render_output, _options)                                                                                                                                        
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 611, in __rich_console__                                                                                   
    segments = Segments(self._get_syntax(console, options))                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 668, in __init__                                                                                          
    self.segments = list(segments)                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 674, in _get_syntax                                                                                        
    lines: Union[List[Text], Lines] = text.split("\n", allow_blank=ends_on_nl)                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1042, in split                                                                                               
    lines = Lines(                                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/containers.py", line 70, in __init__                                                                                        
    self._lines: List["Text"] = list(lines)                                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1043, in <genexpr>                                                                                           
    line for line in self.divide(flatten_spans()) if line.plain != separator                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 385, in plain                                                    
    if len(self._text) != 1:                                                                                                                                                               
KeyboardInterrupt                                                                                                                                                                                                                                                                              

Original exception was:                                                                                                                                                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map                                                                                                                                                                               
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__                                                                                                                                                                                                 
    for obj in iterable:                                                                                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator                                                                                                                                                                                         
    yield _result_or_cancel(fs.pop())                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel                                                                                                                                                                                       
    return fut.result(timeout)                                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 453, in result                                                                                                                                                                                                  
    self._condition.wait(timeout)                                                                                                                                                                                                           
  File "/opt/conda/envs/main/lib/python3.10/threading.py", line 320, in wait                                                                                                                                                                                                                   
    waiter.acquire()                                                                                                                                                                                                                        
KeyboardInterrupt                                                                                                                                                                                                                                                                              

During handling of the above exception, another exception occurred:                                                                                                                                                                                                                            

Traceback (most recent call last):                                                                                                                                                                                                                                                             
  File "/TALI/tali/scripts/validate_dataset.py", line 127, in <module>                                                                                                            
    train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB")                                                                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1583, in push_to_hub                                                                                                                                                                                                                                                      
    repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(                                                                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5275, in _push_parquet_shards_to_hub                                                                                                                                                                                                                                     
    _retry(                                                                                                                                                                                                                                                                                    
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 282, in _retry                                                                                                                                                                                                                                                        
    return func(*func_args, **func_kwargs)                                                                                                                                                                                                                                                     
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                                                                                                                                                             
    return fn(*args, **kwargs)                                                                                                                 
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner                                                                                                                                                                                                                                                           
    return fn(self, *args, **kwargs)                                                                                                                                                                                                                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3205, in upload_file                                                                                                                                                                                                                                                     
    commit_info = self.create_commit(                                  
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                                                                                                                                                             
    return fn(*args, **kwargs)                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner                                                                                                                                                                                                                                                           
    return fn(self, *args, **kwargs)                                   
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2680, in create_commit                                                                                                                                                                                                                                                   
    upload_lfs_files(                                                                    
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                                                                                                                                                             
    return fn(*args, **kwargs)                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 353, in upload_lfs_files                                                                                                                                                                                                                                            
    thread_map(                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map                                                                                                                                                                                                                                                       
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)                                                                                                       
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 49, in _executor_map                                                                                                                                                                                                                                                    
    with PoolExecutor(max_workers=max_workers, initializer=tqdm_class.set_lock,                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 649, in __exit__                                                                                                                                                                                                                                                                     
    self.shutdown(wait=True)                                                             
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown                                                                                                                                                                                                                                                                    
    t.join()                                                                             
  File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1096, in join                                                                                                     
    self._wait_for_tstate_lock()                                                         
  File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock                                                                                                                                                                                                                                                                      
    if lock.acquire(block, timeout):                                                     
KeyboardInterrupt

@Wauplin

What is the total dataset size?

There are three variants, and the random hanging happens on all three. The sizes are 2TB, 1TB, and 200GB.

Is it always failing on the same shard or is the hanging problem happening randomly?

It seems to be very much random, as restarting can help move past the previous hang, only to find a new one, or not.

Were you able to save the dataset as parquet locally? This would help us determine if the problem comes from the upload or the file generation.

Yes. The dataset seems to be locally stored as parquet.

Hmm it looks like an issue with TQDM lock. Maybe you can try updating TQDM ?

I am using the latest version of tqdm

⬢ [Docker] ❯ pip install tqdm --upgrade
Requirement already satisfied: tqdm in /opt/conda/envs/main/lib/python3.10/site-packages (4.65.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

I tried trying to catch the hanging issue in action again

Pushing dataset shards to the dataset hub:  65%|█████████████████████████████████████████████████████████████████▊                                   | 127/195 [2:28:02<1:19:15, 69.94s/it]                                               
Error while uploading 'data/train-00127-of-00195-3f8d036ade107c27.parquet' to the Hub.                                                                                                                                                    
Pushing split train to the Hub.                                                                                                                                                                                                           
Pushing dataset shards to the dataset hub:  64%|████████████████████████████████████████████████████████████████▏                                    | 124/195 [2:06:10<1:12:14, 61.05s/it]C^[^C^C^C                                      
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                                                                                                      
│ /TALI/tali/scripts/validate_dataset.py:127 in <module>                                           │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   124 │                                                                                          │                                                                                                                                      
│   125 │   while not succesful_competion:                                                         │                                                                                                                                      
│   126 │   │   try:                                                                               │                                                                                                                                      
│ ❱ 127 │   │   │   train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB")   │                                                                                                                                      
│   128 │   │   │   succesful_competion = True                                                     │                                                                                                                                      
│   129 │   │   except Exception as e:                                                             │                                                                                                                                      
│   130 │   │   │   print(e)                                                                       │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py:1583 in push_to_hub   │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   1580 │   │   for split in self.keys():                                                         │                                                                                                                                      
│   1581 │   │   │   logger.warning(f"Pushing split {split} to the Hub.")                          │                                                                                                                                      
│   1582 │   │   │   # The split=key needs to be removed before merging                            │                                                                                                                                      
│ ❱ 1583 │   │   │   repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parq  │                                                                                                                                      
│   1584 │   │   │   │   repo_id,                                                                  │                                                                                                                                      
│   1585 │   │   │   │   split=split,                                                              │                                                                                                                                      
│   1586 │   │   │   │   private=private,                                                          │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5263 in              │                                                                                                                                      
│ _push_parquet_shards_to_hub                                                                      │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   5260 │   │                                                                                     │                                                                                                                                      
│   5261 │   │   uploaded_size = 0                                                                 │                                                                                                                                      
│   5262 │   │   shards_path_in_repo = []                                                          │                                                                                                                                      
│ ❱ 5263 │   │   for index, shard in logging.tqdm(                                                 │                                                                                                                                      
│   5264 │   │   │   enumerate(itertools.chain([first_shard], shards_iter)),                       │                                                                                                                                      
│   5265 │   │   │   desc="Pushing dataset shards to the dataset hub",                             │                                                                                                                                      
│   5266 │   │   │   total=num_shards,                                                             │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│ /opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py:1178 in __iter__                   │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   1175 │   │   time = self._time                                                                 │                                                                                                                                      
│   1176 │   │                                                                                     │                                                                                                                                      
│   1177 │   │   try:                                                                              │
│ ❱ 1178 │   │   │   for obj in iterable:                                                          │
│   1179 │   │   │   │   yield obj                                                                 │
│   1180 │   │   │   │   # Update and possibly print the progressbar.                              │
│   1181 │   │   │   │   # Note: does not call self.update(1) for speed optimisation.              │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5238 in              │
│ shards_with_embedded_external_files                                                              │
│                                                                                                  │
│   5235 │   │   │   │   for shard in shards:                                                      │
│   5236 │   │   │   │   │   format = shard.format                                                 │
│   5237 │   │   │   │   │   shard = shard.with_format("arrow")                                    │
│ ❱ 5238 │   │   │   │   │   shard = shard.map(                                                    │
│   5239 │   │   │   │   │   │   embed_table_storage,                                              │
│   5240 │   │   │   │   │   │   batched=True,                                                     │
│   5241 │   │   │   │   │   │   batch_size=1000,                                                  │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:578 in wrapper       │
│                                                                                                  │
│    575 │   │   else:                                                                             │
│    576 │   │   │   self: "Dataset" = kwargs.pop("self")                                          │
│    577 │   │   # apply actual function                                                           │
│ ❱  578 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                │                                         
│    579 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou  │                                         
│    580 │   │   for dataset in datasets:                                                          │                                         
│    581 │   │   │   # Remove task templates if a column mapping of the template is no longer val  │                                         
│                                                                                                  │                                         
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:543 in wrapper       │                                         
│                                                                                                  │                                         
│    540 │   │   │   "output_all_columns": self._output_all_columns,                               │                                         
│    541 │   │   }                                                                                 │                                         
│    542 │   │   # apply actual function                                                           │                                                                  
│ ❱  543 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                │                                                                  
│    544 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou  │                                                                  
│    545 │   │   # re-apply format to the output                                                   │                                                                  
│    546 │   │   for dataset in datasets:                                                          │                                                                  
│                                                                                                  │                                                                  
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3073 in map          │                                                                  
│                                                                                                  │                                                                  
│   3070 │   │   │   │   │   leave=False,                                                          │                                                                  
│   3071 │   │   │   │   │   desc=desc or "Map",                                                   │                                                                  
│   3072 │   │   │   │   ) as pbar:                                                                │                                                                  
│ ❱ 3073 │   │   │   │   │   for rank, done, content in Dataset._map_single(**dataset_kwargs):     │                                                                  
│   3074 │   │   │   │   │   │   if done:                                                          │                                                                  
│   3075 │   │   │   │   │   │   │   shards_done += 1                                              │                                                                                                     
│   3076 │   │   │   │   │   │   │   logger.debug(f"Finished processing shard number {rank} of {n  │                                                                                                     
│                                                                                                  │                                                                                                     
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3464 in _map_single  │                                                                                                     
│                                                                                                  │                                                                                                     
│   3461 │   │   │   │   │   │   │   │   buf_writer, writer, tmp_file = init_buffer_and_writer()   │                                                                                                     
│   3462 │   │   │   │   │   │   │   │   stack.enter_context(writer)                               │                                                                                                     
│   3463 │   │   │   │   │   │   │   if isinstance(batch, pa.Table):                               │                                                                                                     
│ ❱ 3464 │   │   │   │   │   │   │   │   writer.write_table(batch)                                 │                                                                                                     
│   3465 │   │   │   │   │   │   │   else:                                                         │                                                                                                     
│   3466 │   │   │   │   │   │   │   │   writer.write_batch(batch)                                 │                                                                                                     
│   3467 │   │   │   │   │   │   num_examples_progress_update += num_examples_in_batch             │                                                                                                     
│                                                                                                  │                                                                                                     
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_writer.py:567 in write_table    │                                                                                                     
│                                                                                                  │                                                                                                     
│   564 │   │   │   writer_batch_size = self.writer_batch_size                                     │                                                                                                     
│   565 │   │   if self.pa_writer is None:                                                         │                                                                                                     
│   566 │   │   │   self._build_writer(inferred_schema=pa_table.schema)                            │                                                                                                     
│ ❱ 567 │   │   pa_table = pa_table.combine_chunks()                                               │                                                                                                     
│   568 │   │   pa_table = table_cast(pa_table, self._schema)                                      │                                                                                                     
│   569 │   │   if self.embed_local_files:                                                         │                                                                                                     
│   570 │   │   │   pa_table = embed_table_storage(pa_table)                                       │                                                                                                     
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯                                                                                                     
KeyboardInterrupt

I'm on my phone so can't help that much. What I'd advice to do is to save_to_disk if it's not already done and then upload the files/folder to the Hub separately. You can find what you need in the upload guide. It might not help finding the exact issue for now but at least it can unblock you.

In your last stacktrace it interrupted while embedding external content - in case your dataset in made of images or audio files that live on your disk. Is it the case ?

Yeah, the dataset has images, audio, video and text.

It's maybe related to https://github.com/apache/arrow/issues/34455: are you using ArrayND features ?

Also what's your pyarrow version ? Could you try updating to >= 12.0.1 ?

I was using pyarrow == 12.0.0

I am not explicitly using ArrayND features, unless the hub API automatically converts my files to such.

I have now updated to pyarrow == 12.0.1 and retrying

You can also try to reduce the max_shard_size - Sometimes parquet has a hard time working with data bigger than 2GB

So, updating the pyarrow seems to help. It can still throw errors here and there but I can retry when that happens. It's better than hanging.

However, I am a bit confused about something. I have uploaded my datasets, but while earlier I could see all three sets, now I can only see 1. What's going on? https://huggingface.co/datasets/Antreas/TALI-base

I have seen this happen before as well, so I deleted and reuploaded, but this dataset is way too large for me to do this.

It's a bug on our side, I'll update the dataset viewer ;)

Thanks for reporting !

Apparently this happened because of bad modifications in the README.md split metadata.

I fixed them in this PR: https://huggingface.co/datasets/Antreas/TALI-base/discussions/1

@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.

Also, just found another related issue. One of the many that make things hang or fail when pushing to hub.

In the following code:

train_generator = lambda: data_generator("train", percentage=1.0)
    val_generator = lambda: data_generator("val")
    test_generator = lambda: data_generator("test")

    train_data = datasets.Dataset.from_generator(
        train_generator,
        num_proc=mp.cpu_count(),
        writer_batch_size=5000,
        cache_dir=tali_dataset_dir,
    )

    val_data = datasets.Dataset.from_generator(
        val_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    test_data = datasets.Dataset.from_generator(
        test_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    print(f"Pushing TALI-large to hub")

    dataset = datasets.DatasetDict(
        {"train": train_data, "val": val_data, "test": test_data}
    )
    succesful_competion = False

    while not succesful_competion:
        try:
            dataset.push_to_hub(repo_id="Antreas/TALI-large", max_shard_size="2GB")
            succesful_competion = True
        except Exception as e:
            print(e)

Things keep failing in the push_to_repo step, at random places, with the following error:

  Pushing dataset shards to the dataset hub:   7%|██████████▋                                                                                                                                            | 67/950 [42:41<9:22:37, 38.23s/it]
Error while uploading 'data/train-00067-of-00950-a4d179ed5a593486.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.81ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.20s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.48ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.30s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.39ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.52s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.47ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.39s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.26ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub:   7%|███████████▎                                                                                                                                           | 71/950 [44:37<9:12:28, 37.71s/it]
Error while uploading 'data/train-00071-of-00950-72bab6e5cb223aee.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.18ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.94s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.36ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.57ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.16s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.68ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.63s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.36ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.37ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub:   8%|████████████                                                                                                                                           | 76/950 [46:21<8:53:08, 36.60s/it]
Error while uploading 'data/train-00076-of-00950-b90e4e3b433db179.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.21ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:25<00:00, 25.40s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.56ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.40s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.49ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.53s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.27ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.25s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.42ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.03s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.39ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub:   9%|████████████▊                                                                                                                                          | 81/950 [48:30<8:40:22, 35.93s/it]
Error while uploading 'data/train-00081-of-00950-84b0450a1df093a9.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.18ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.65s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.92ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub:   9%|█████████████                                                                                                                                          | 82/950 [48:55<8:37:57, 35.80s/it]
Error while uploading 'data/train-00082-of-00950-0a1f52da35653e08.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.31ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.29s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.42ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.57s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.64ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.35s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.64ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.74s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.31ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:40<?, ?it/s]
Pushing dataset shards to the dataset hub:   9%|█████████████▋                                                                                                                                         | 86/950 [50:48<8:30:25, 35.45s/it]
Error while uploading 'data/train-00086-of-00950-e1cc80dd17191b20.parquet' to the Hub.

I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.

Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?

Thank you for your help and time.

@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.

Hmm this shouldn't happen. What code did you run exactly ? Using which version of datasets ?

I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.

Could you also print the cause of the error (e.__cause__) ? Or show the full stack trace when the error happens ? This would give more details about why it failed and would help investigate.

Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?

Parquet is supported out of the box ^^

If you want to make sure it works as expected you can try locally first:

ds = load_dataset("path/to/local", streaming=True)

@lhoestq @AntreasAntoniou I transferred this issue to the datasets repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating tqdm and pyarrow or setting a lower max_shard_size).

~For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.~

EDIT: removed suggestion about saving to disk first (see https://github.com/huggingface/datasets/issues/5990#issuecomment-1607186914).

@lhoestq @AntreasAntoniou I transferred this issue to the datasets repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120204 and https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120278 or https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120328).

thanks :)

For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.

As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk, which is meant for disk only.

As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk, which is meant for disk only.

Well noted, thanks. That part was not clear to me :)

Sorry for not replying in a few days, I was on leave. :)

So, here are more information as to the error that causes some of the delay

Pushing Antreas/TALI-tiny to hub
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00,  4.06s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00,  4.15s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:26<00:00,  4.45s/ba]
/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py:310: UserWarning: hf_transfer is enabled but does not support uploading from bytes or BinaryIO, falling back to regular upload
  warnings.warn(
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:25<00:00,  4.26s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:27<00:00,  4.58s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00,  4.10s/ba]
Pushing dataset shards to the dataset hub:  22%|████████████████████████▎                                                                                       | 5/23 [52:23<3:08:37, 628.74s/it]
Exception: Error while uploading 'data/train-00005-of-00023-e224d901fd65e062.parquet' to the Hub., with stacktrace: <traceback object at 0x7f745458d0c0>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/7c/d3/7cd385d9324302dc13e3986331d72d9be6fa0174c63dcfe0e08cd474f7f1e8b7/3415166ae28c0beccbbc692f38742b8dea2c197f5c805321104e888d21d7eb90?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230627%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230627T003349Z&X-Amz-Expires=86400&X-Amz-Signature=5a12ff96f2
91f644134170992a6628e5f3c4e7b2e7fc3e940b4378fe11ae5390&X-Amz-SignedHeaders=host&partNumber=1&uploadId=JSsK8r63XSF.VlKQx3Vf8OW4DEVp5YIIY7LPnuapNIegsxs5EHgM1p4u0.Nn6_wlPlQnvxm8HKMxZhczKE9KB74t0etB
oLcxqBIvsgey3uXBTZMAEGwU6y7CDUADiEIO&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.

One issue is that the uploading does not continue from the chunk it failed off. It often continues from a very old chunk. e.g. if it failed on chunk 192/250, it will continue from say 53/250, and this behaviour appears almost random.

Are you using a proxy of some sort ?

I am using a kubernetes cluster built into a university VPN.

So, other than the random connection drops here and there, any idea why the progress does not continue where it left off?

Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.79ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.65ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.39ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.04ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.52ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.28ba/s]
Pushing dataset shards to the dataset hub:  20%|██████████████████████                                                                                          | 75/381 [1:34:39<6:26:11, 75.72s/it]
Exception: Error while uploading 'data/train-00075-of-00381-1614bc251b778766.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab6d9a4980>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/ed8dae933fb79ae1ef5fb1f698f5125d3e1c02977ac69438631f152bb3bfdd1e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T053004Z&X-Amz-Expires=86400&X-Amz-Signature=da2b26270edfd6d0
d069c015a5a432031107a8664c3f0917717e5e40c688183c&X-Amz-SignedHeaders=host&partNumber=1&uploadId=2erWGHTh3ICqBLU_QvHfnygZ2tkMWbL0rEqpJdYohCKHUHnfwMjvoBIg0TI_KSGn4rSKxUxOyqSIzFUFSRSzixZeLeneaXJOw.Qx8
zLKSV5xV7HRQDj4RBesNve6cSoo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.09ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 11.51ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.77ba/s]
Pushing dataset shards to the dataset hub:  20%|██████████████████████▋                                                                                         | 77/381 [1:32:50<6:06:34, 72.35s/it]
Exception: Error while uploading 'data/train-00077-of-00381-368b2327a9908aab.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45b27f80>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/9462ff2c5e61283b53b091984a22de2f41a2f6e37b681171e2eca4a998f979cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T070510Z&X-Amz-Expires=86400&X-Amz-Signature=9ab8487b93d443cd
21f05476405855d46051a0771b4986bbb20f770ded21b1a4&X-Amz-SignedHeaders=host&partNumber=1&uploadId=UiHX1B.DcoAO2QmIHpWpCuNPwhXU_o1dsTkTGPqZt1P51o9k0yz.EsFD9eKpQMwgAST3jOatRG78I_JWRBeLBDYYVNp8r0TpIdeSg
eUg8uwPZOCPw9y5mWOw8MWJrnBo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub:   8%|████████▋                                                                                                         | 29/381 [27:39<5:50:03, 59.67s/it]
Map:  36%|████████████████████████████████████████████████████                                                                                            | 1000/2764 [00:35<00:34, 51.63 examples/Map:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                       | 2000/2764 [00:40<00:15, 49.06 examples/Map:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                       | 2000/2764 [00:55<00:15, 49.06 examples/Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2764/2764 [00:56<00:00, 48.82 examples/Pushing dataset shards to the dataset hub:   8%|████████▉                                                                                                         | 30/381 [28:35<5:43:03, 58.64s/iPushing dataset shards to the dataset hub:   8%|█████████▎                                                                                                        | 31/381 [29:40<5:52:18, 60.40s/iPushing dataset shards to the dataset hub:   8%|█████████▌                                                                                                        | 32/381 [30:46<6:02:20, 62.29s/it]                                                                                                                                                                                                 
Map:  36%|███████████████████████████████████████████████████▎

This is actually the issue that wastes the most time for me, and I need it fixed. Please advice on how I can go about it.

Notice how the progress goes from | 77/381 to 30/381

If the any shard is missing on the Hub, it will re-upload it. It looks like the 30th shard was missing on the Hub in your case.

It also means that the other files up to the 77th that were successfully uploaded won't be uploaded again.

cc @mariosasko who might know better

@lhoestq That can't be right. The 30th shard was successfully pushed earlier. I confirmed that at the time.

It somehow went back to 22 now.

Pushing dataset shards to the dataset hub:  20%|██████████████████████▉                                                                                         | 78/381 [1:16:47<5:43:43, 68.06s/iCreating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.95ba/s]
Pushing dataset shards to the dataset hub:  21%|███████████████████████▏                                                                                        | 79/381 [1:18:16<6:15:34, 74.62s/iCreating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.29ba/s]
Pushing dataset shards to the dataset hub:  21%|███████████████████████▌                                                                                        | 80/381 [1:19:39<6:25:33, 76.86s/iCreating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 11.94ba/s]
^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[Pushing dataset shards to the dataset hub:  21%|███████████████████████▌                                                                                        | 80/381 [1:37:18<6:06:06, 72.98s/it]
Exception: Error while uploading 'data/train-00080-of-00381-062438dd5e7ca2d7.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45ba0080>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/c6b3b2de546aa432c14341a4f7691dd7518ac49dc2a5635b47937dd59007b93b?X-Amz-Algorithm=AWS4-HMAC-SHA256&
X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T084450Z&X-Amz-Expires=86400&X-Amz-Signature=98986fc1300e
e3f47e9bc9d7f1ee8b303d5ed3e1959d9fa988cdc5c49c457054&X-Amz-SignedHeaders=host&partNumber=1&uploadId=AWFFr6YCiEl.uXo8.EP00v9KlT7z_atlfnuI.DA1zzDf3sq2OY5HabWAQ480nnajYvJdHYif3.YCJxTTmtATT3_pfQBjwTc
4AsIRPaip5blkRVINhe69WyPo_sreoHdv&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.46ba/s]
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.54ba/s]
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.08ba/s]
Creating parquet from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.16ba/s]
Pushing dataset shards to the dataset hub:  22%|███████████████████████▉                                                                                      | 83/381 [1:40:31<6:00:54, 72.67s/it]
Exception: Error while uploading 'data/train-00083-of-00381-7f61e92530de6c6f.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45b27f80>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/a2efe54ae3c5eaa5161fef804a3f633a333e9336560d879ab1dcc684ac5f298f?X-Amz-Algorithm=AWS4-HMAC-SHA256&
X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T102743Z&X-Amz-Expires=86400&X-Amz-Signature=0b3d008d3e39
20efa9ecf18ef4d896b2d1d82e6e67f4ed33770e7e8896b738f6&X-Amz-SignedHeaders=host&partNumber=1&uploadId=1uab1rS4FApXg_6J7WIU6papbUY2Cm1W8cla15LeqUvbDyDm_3_BQzMkiOhqBt2odhoqTZPKO8uY0zQ3XWOXSeezPumliRw
CJbCnRDt__xowcCZp2.NZklf2kpuPJxB_&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub:   6%|██████▍                                                                                                         | 22/381 [20:47<6:10:15, 61.88s/it]
Map:  36%|████████████████████████████████████████████████████                                                                                            | 1000/2764 [00:22<00:40, 44.01 examples/s

I see, maybe there's a bug that would cause the fingerprint to not be deterministic then. Sorry for the inconvenience, we'll investigate

According to the commits list, the files are being uploaded in order without duplicates. No file is uploaded twice.

Therefore it seems there is something else that is blocking the resuming progress mid-way somehow.

For each shard we first load the media files (e.g. images, audio) into the Arrow data before uploading. Could it be this step that is hanging ? If you try to interrupt the program when it hangs at a shard that has already been uploaded, what does the stacktrace say ? It could help locate what part in the code is blocking it.

I'll do that when it hangs. Meanwhile, if the files are uploaded in order, why can't the process automatically see that locally? It's remapping all the shards, and then I guess 'pushes' them, but sees that they exist and moves to the next. However, the processing time of remapping is significant. Is there a way to fix this? To avoid this remapping process for shards that have already been pushed.

I agree we should ideally check if the file has been uploaded before embedding the media files indeed !

Can you point me to the relevant part of the code? I wouldn't mind taking care of this. :)

Sure ! The for loop that iterates on the shards to upload and check if the file has already been uploaded is here:

https://github.com/huggingface/datasets/blob/e9aee64766aaddfda60a735cfc93345aed64bdcf/src/datasets/arrow_dataset.py#L5283-L5291

and the code that applies the external files embedding to arrow is a few lines earlier:

https://github.com/huggingface/datasets/blob/e9aee64766aaddfda60a735cfc93345aed64bdcf/src/datasets/arrow_dataset.py#L5254-L5267

I think one way to make it work would be to call path_in_repo and check if the file is in the repository before calling map

@lhoestq This took a while due to vacations, but I now have a working draft at https://github.com/huggingface/datasets/pull/6056

If you could review and comment that'd be great!

This comment has the code that you can run to avoid rerunning the "embed external data" step.

Also, as mentioned in the comment, these bytes will be embedded automatically in Datasets 3.0 to, among other things, make push_to_hub faster.

I just tried the latest version of datasets, with the push_to_hub function, on a large dataset, things still seem to hang, here is the context after forcibly killing the process with control + c

Starting preparation and upload with arguments dataset_name: Antreas/TALI-big-2.0, data_percentage: 1.0, num_data_samples: None, max_shard_size: 10GB, num_workers: 1                 
Map: 100%|________________________________________________________________________________________________________________________________| 2633/2633 [00:51<00:00, 50.84 examples/s]              
Creating parquet from Arrow format: 100%|____________________________________________________________________________________________________________| 27/27 [01:35<00:00,  3.53s/ba] 
Map: 100%|________________________________________________________________________________________________________________________________| 2633/2633 [01:03<00:00, 41.22 examples/s]                                                                                          
Creating parquet from Arrow format: 100%|____________________________________________________________________________________________________________| 27/27 [01:32<00:00,  3.44s/ba]                         
Pushing dataset shards to the dataset hub:   0%|_                                                                                              | 1/400 [19:45<131:21:16, 1185.15s/it]                                                                                          
^CError in sys.excepthook:                                                                                                                                                                                    
Traceback (most recent call last):                                                                                                                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1699, in print                                                                                                                                                                                 
    extend(render(renderable, render_options))                                                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render                                                                                                              
    yield from self.render(render_output, _options)                                                                                                                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                                              
    for render_output in iter_render:                                                                                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/constrain.py", line 29, in __rich_console__                                                                                                    
    yield from console.render(self.renderable, child_options)                                                                                                                                                                                                                  
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                                              
    for render_output in iter_render:                                                                                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/panel.py", line 220, in __rich_console__                                                                                                       
    lines = console.render_lines(renderable, child_options, style=style)                                                                                                                                                                                                       
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines                                                                                                        
    lines = list(                                                                                                                                                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines                                                                         
    for segment in segments:                                                                                                                                                                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                                              
    for render_output in iter_render:                                                                                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/padding.py", line 97, in __rich_console__                                                                                                      
    lines = console.render_lines(                                                                                                                                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines                                                                                                        
    lines = list(                                                                                                                                                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines                                                                                                                                                                  
    for segment in segments:                                                                                                                                                                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render                                                                                                                                    
    yield from self.render(render_output, _options)                                                                                                                                                                                                                            
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                                                                    
    for render_output in iter_render:                                                                                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 611, in __rich_console__                                                                                                                                                                                                                         
    segments = Segments(self._get_syntax(console, options))                                                                                                                                                                                                                    
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 668, in __init__                                                                                                                                                                                                                                
    self.segments = list(segments)                                                                                                                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 639, in _get_syntax                                                                                                                                                                                                                              
    text = self.highlight(processed_code, self.line_range)                                                                                                                                                                                                                     
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 470, in highlight                                                                                                                                   
    lexer = self.lexer                                                                                                                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 433, in lexer                                                                                                                                                                                                                                    
    return get_lexer_by_name(                                                                                                                                                                                                                                                  
  File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexers/__init__.py", line 126, in get_lexer_by_name                                                                                                                                                                                                           
    return _lexer_cache[name](**options)                                                                                                                                                                                                                                       
  File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 641, in __call__                                                                                                           
    cls._tokens = cls.process_tokendef('', cls.get_tokendefs())                                                                        
  File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 580, in process_tokendef                                                                                                                                                                                                                      
    cls._process_state(tokendefs, processed, state)                                                                                    
  File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 543, in _process_state                                                                                                                                                                                                                        
    tokens.extend(cls._process_state(unprocessed, processed,                                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 543, in _process_state                                                                                                                                                                                                                        
    tokens.extend(cls._process_state(unprocessed, processed,                                                                                                                                       
  File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 559, in _process_state                                                                                                                                                                                                                        
    rex = cls._process_regex(tdef[0], rflags, state)                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/pygments/lexer.py", line 488, in _process_regex                                                                                                                                                                                                                        
    return re.compile(regex, rflags).match                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/re.py", line 251, in compile                                                                                                                                                                                                                                                         
    return _compile(pattern, flags)                                                                                                    
  File "/opt/conda/envs/main/lib/python3.10/re.py", line 303, in _compile                                                                                                                                                                                                                                                        
    p = sre_compile.compile(pattern, flags)                                                                                                                                                        
  File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 792, in compile                                                                                                                                                                                                                                                
    code = _code(p, flags)                                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 631, in _code                                                                                                                                                                                                                                                  
    _compile(code, p.data, flags)                                                                                                                                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 136, in _compile                                                                                                                                                                                                                                               
    charset, hascased = _optimize_charset(av, iscased, tolower, fixes)                                                                                                                                                                                                                                                           
  File "/opt/conda/envs/main/lib/python3.10/sre_compile.py", line 328, in _optimize_charset                                                                                                                                                                                                                                      
    charmap[i] = 1                                                                                                                                                                                                                                                                                                               
KeyboardInterrupt                                                                                                                                                                                                                                                                                                                

Original exception was:                                                                                                                                                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                 
  File "/root/TALI/tali/scripts/upload_dataset_from_disk_to_hf.py", line 73, in <module>                                                                                                                                                                                                                                         
    fire.Fire(main)                                                                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/fire/core.py", line 141, in Fire                                                                                                                                                                                                                                       
    component_trace = _Fire(component, args, parsed_flag_args, context, name)                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire                                                                                                                                                                                                                                      
    component, remaining_args = _CallAndUpdateTrace(                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace                                                                                                                                                                                                                        
    component = fn(*varargs, **kwargs)                                                                                                                                                             
  File "/root/TALI/tali/scripts/upload_dataset_from_disk_to_hf.py", line 62, in main                                                                                                                                                                                                                                             
    dataset.push_to_hub(                                                                                                                                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1641, in push_to_hub                                                                                                                                                                                                                   
    repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(                                                                                                 
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5307, in _push_parquet_shards_to_hub                                                                                                                                                                                                  
    _retry(                                                                                                                                                                                        
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 290, in _retry                                                                                       
    return func(*func_args, **func_kwargs)                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                             
    return fn(*args, **kwargs)                                                                                                                                                                     
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner                                                                                                                           
    return fn(self, *args, **kwargs)                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3221, in upload_file                                                                                                                     
    commit_info = self.create_commit(                                                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                             
    return fn(*args, **kwargs)                                                                                                                                                                     
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 828, in _inner                                                                                                                           
    return fn(self, *args, **kwargs)                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2695, in create_commit                                                                                                                   
    upload_lfs_files(                                                                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                             
    return fn(*args, **kwargs)                                                                                                                                                                     
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 393, in upload_lfs_files                                                                           
    _wrapped_lfs_upload(filtered_actions[0])                                                                                                                                                       
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 383, in _wrapped_lfs_upload                                                                                                                                                                                                      
    lfs_upload(operation=operation, lfs_batch_action=batch_action, token=token)                                                                                                                    
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 223, in lfs_upload                                                                                         
    _upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_action["href"])                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 319, in _upload_multi_part                                                                                 
    else _upload_parts_iteratively(operation=operation, sorted_parts_urls=sorted_parts_urls, chunk_size=chunk_size)                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 375, in _upload_parts_iteratively                                                                          
    part_upload_res = http_backoff("PUT", part_upload_url, data=fileobj_slice)                                                                                                                     
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 258, in http_backoff                                                                               
    response = session.request(method=method, url=url, **kwargs)                                                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/site-packages/requests/sessions.py", line 589, in request                                                                                              
    resp = self.send(prep, **send_kwargs)                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/requests/sessions.py", line 703, in send                                                                                                 
    r = adapter.send(request, **kwargs)                                                                                                                                                            
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 63, in send                                                                                        
    return super().send(request, *args, **kwargs)                                                                                                                                                  
  File "/opt/conda/envs/main/lib/python3.10/site-packages/requests/adapters.py", line 486, in send                                                                                                 
    resp = conn.urlopen(                                                                                                                                                                                                            
  File "/opt/conda/envs/main/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen                                                                                         
    httplib_response = self._make_request(                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request                                                                                                                    
    conn.request(method, url, **httplib_request_kw)                                                                                                                                                                                 
  File "/opt/conda/envs/main/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request                                                                                                                              
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)                                                                                                                                                    
  File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1283, in request                                                                                                                                                  
    self._send_request(method, url, body, headers, encode_chunked)                                                                                                                                                                  
  File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1329, in _send_request                                                                                                                                            
    self.endheaders(body, encode_chunked=encode_chunked)                                                                                                                                                                            
  File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1278, in endheaders                                                                                                                                               
    self._send_output(message_body, encode_chunked=encode_chunked)                                                                                                                                                                  
  File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 1077, in _send_output                                                                                                                                             
    self.send(chunk)                                                                                                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/http/client.py", line 999, in send                                                                                                                                                      
    self.sock.sendall(data)                                                                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/ssl.py", line 1237, in sendall                                                                                                                                                          
    v = self.send(byte_view[count:])                                                                                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/ssl.py", line 1206, in send                                                                                                                                                             
    return self._sslobj.write(data)                                                                                                                                                                                                 
KeyboardInterrupt                                                                                                                                                                                                                   
^C^C^C_

It seems to hang during the PUT request to upload the data. Can you check your network ?

Having this same issue now. With image/textual data

huggingface / datasets