Error while downloading the xtreme udpos dataset

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.01k stars 2.63k forks source link

Error while downloading the xtreme udpos dataset #5594

Closed simran-khanuja closed 1 year ago

simran-khanuja commented 1 year ago

Describe the bug

Hi,

I am facing an error while downloading the xtreme udpos dataset using load_dataset. I have datasets 2.10.1 installed

Downloading data:  16%|██████████████▏                                                                          | 56.9M/355M [03:11<16:43, 297kB/s]
Generating train split:   0%|                                                                                      | 0/6075 [00:00<?, ? examples/s]Traceback (most recent call last):
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 1608, in _prepare_split_single
    for key, record in generator:
  File "/home/skhanuja/.cache/huggingface/modules/datasets_modules/datasets/xtreme/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/xtreme.py", line 732, in _generate_examples
    yield from UdposParser.generate_examples(config=self.config, filepath=filepath, **kwargs)
  File "/home/skhanuja/.cache/huggingface/modules/datasets_modules/datasets/xtreme/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/xtreme.py", line 921, in generate_examples
    for path, file in filepath:
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/download/download_manager.py", line 158, in __iter__
    yield from self.generator(*self.args, **self.kwargs)
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/download/download_manager.py", line 211, in _iter_from_path
    yield from cls._iter_tar(f)
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/download/download_manager.py", line 167, in _iter_tar
    for tarinfo in stream:
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/tarfile.py", line 2475, in __iter__
    tarinfo = self.next()
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/tarfile.py", line 2344, in next
    raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/skhanuja/Optimal-Resource-Allocation-for-Multilingual-Finetuning/src/train_al.py", line 855, in <module>
    main()
  File "/home/skhanuja/Optimal-Resource-Allocation-for-Multilingual-Finetuning/src/train_al.py", line 487, in main
    train_dataset = load_dataset(dataset_name, source_language, split="train", cache_dir=args.cache_dir, download_mode="force_redownload")
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 1488, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Steps to reproduce the bug

train_dataset = load_dataset('xtreme', 'udpos.English', split="train", cache_dir=args.cache_dir, download_mode="force_redownload")

Expected behavior

Download the udpos dataset

Environment info

datasets version: 2.10.1
Platform: Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.8
PyArrow version: 10.0.1
Pandas version: 1.5.2

mariosasko commented 1 year ago

Hi! I cannot reproduce this error on my machine.

The raised error could mean that one of the downloaded files is corrupted. To verify this is not the case, you can run load_dataset as follows:

train_dataset = load_dataset('xtreme', 'udpos.English', split="train", cache_dir=args.cache_dir, download_mode="force_redownload", verification_mode="all_checks")

simran-khanuja commented 1 year ago

Hi! Apologies for the delayed response! I tried the above and it doesn't solve the issue. Actually, the dataset gets downloaded most times, but sometimes this error occurs (at random afaik). Is it possible that there is a server issue for this particular dataset? I am able to download other datasets using the same code on the same machine with no issues :( I get this error now :

Downloading data:  16%|███████████████▌                                                                                   | 55.9M/355M [04:45<25:25, 196kB/s]
Traceback (most recent call last):
  File "/home/skhanuja/Optimal-Resource-Allocation-for-Multilingual-Finetuning/src/train_al.py", line 1107, in <module>
    main()
  File "/home/skhanuja/Optimal-Resource-Allocation-for-Multilingual-Finetuning/src/train_al.py", line 439, in main
    en_dataset = load_dataset("xtreme", "udpos.English", split="train", download_mode="force_redownload", verification_mode="all_checks")
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 949, in _download_and_prepare
    verify_checksums(
  File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 62, in verify_checksums
    raise NonMatchingChecksumError(
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz']
Set `verification_mode='no_checks'` to skip checksums verification and ignore this error

mariosasko commented 1 year ago

If this happens randomly, then this means the data file from the error message is not always downloaded correctly.

The only solution in this scenario is to download the dataset again by passing download_mode="force_redownload" to the load_dataset call.

RuntimeRacer commented 11 months ago

Wow. I effectively have to redownload a dataset of 1TB because of this now? Because 3% of its parts are broken?

Why is this downloader library so sh*t and badly documented also? I found almost nothing on the net, at least finally this issue about the problem here. No words to express how disappointed I am by that dataset tool provided by Huggingface here, which I sadly have to use because HF is the only place where the Dataset I plan to work with is hosted....

I mean... checksum check after download... or hitting timeout of a part... and redownload if not matching... that's content of every junior developer training session.

I added verification_mode="all_checks". And it really calculated checksums for 4096 parts of ~350 MB... But then did nothing and tried to extract still, hitting the error again.

EDIT: Apparently it is able to fix it by getting a little help: Just delete the broken parts and associated files from ~/.cache/huggingface/datasets/downloads

jaggzh commented 10 months ago

I'm getting it too, although just retrying fixed it. Nevertheless, the dataset is too large to have re-downloaded the whole thing, for it's probably just one file with an issue. It would be good to know if there's a way people could manually examine the files (first for sizes, then possibly checksums)... going to the web or elsewhere to compare and correct it by hand, if ever needed.

jaggzh commented 10 months ago

Okay, no, it got further but it is repeatedly giving me:


result["audio"] = {"path": path, "bytes": file.read()}
^^^^^^^^^^^
File "/usr/lib/python3.11/tarfile.py", line 687, in read
raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jaggz/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 625, in <module>
main()
File "/home/jaggz/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 360, in main
raw_datasets["train"] = load_dataset(
^^^^^^^^^^^^^
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1717, in _download_and_prepare
super()._download_and_prepare(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1555, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1712, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the datase

jaggzh commented 10 months ago

@RuntimeRacer

EDIT: Apparently it is able to fix it by getting a little help: Just delete the broken parts and associated files from ~/.cache/huggingface/datasets/downloads

How do you know the broken parts? Mine's consistently erroring and.. yeah, really this thing should be able to check the files (but where's that even done)...

2023-11-02 00:14:09.846055: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py:299: FutureWarning: The use_auth_token argument is deprecated and will be removed in v4.34. Please use token instead. warnings.warn( 11/02/2023 00:14:37 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True 11/02/2023 00:14:37 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, ... logging_dir=./whisper-tiny-en/runs/Nov02_00-14-28_jsys, ... run_name=./whisper-tiny-en, ... weight_decay=0.0, ) 11/02/2023 00:14:37 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, ... logging_dir=./whisper-tiny-en/runs/Nov02_00-14-28_jsys, ... weight_decay=0.0, )

Downloading data files: 0%| | 0/5 [00:00<?, ?it/s] Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 2426.42it/s]

Extracting data files: 0%| | 0/5 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 421.16it/s]

Downloading data files: 0%| | 0/5 [00:00<?, ?it/s] Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 18707.87it/s]

Extracting data files: 0%| | 0/5 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 3754.97it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Reading metadata...: 0it [00:00, ?it/s][A ... Reading metadata...: 948736it [00:23, 40632.92it/s]

Generating train split: 1 examples [00:23, 23.37s/ examples] ... Generating train split: 948736 examples [08:28, 1866.15 examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Reading metadata...: 0it [00:00, ?it/s][A

Reading metadata...: 16089it [00:00, 157411.88it/s][A Reading metadata...: 16354it [00:00, 158233.27it/s]

Generating validation split: 1 examples [00:00, 7.60 examples/s] Generating validation split: 16354 examples [00:14, 1154.77 examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Reading metadata...: 0it [00:00, ?it/s][A Reading metadata...: 16354it [00:00, 194855.03it/s]

Generating test split: 1 examples [00:00, 4.53 examples/s] Generating test split: 16354 examples [00:07, 2105.43 examples/s]

Generating other split: 0 examples [00:00, ? examples/s]

Reading metadata...: 0it [00:00, ?it/s][A Reading metadata...: 290846it [00:01, 235823.90it/s]

Generating other split: 1 examples [00:01, 1.27s/ examples] ... Generating other split: 290846 examples [02:12, 2196.96 examples/s] Generating invalidated split: 0 examples [00:00, ? examples/s] Reading metadata...: 252599it [00:01, 241965.85it/s]

Generating invalidated split: 1 examples [00:01, 1.08s/ examples] ... Generating invalidated split: 60130 examples [00:34, 1764.14 examples/s] Traceback (most recent call last): File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/builder.py", line 1676, in _prepare_split_single for key, record in generator: File "/home/j/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_11_0/3f27acf10f303eac5b6fbbbe02495aeddb46ecffdb0a2fe3507fcfbf89094631/common_voice_11_0.py", line 195, in _generate_examples result["audio"] = {"path": path, "bytes": file.read()} ^^^^^^^^^^^ File "/usr/lib/python3.11/tarfile.py", line 687, in read raise ReadError("unexpected end of data") tarfile.ReadError: unexpected end of data

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 625, in main() File "/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 360, in main raw_datasets["train"] = load_dataset( ^^^^^^^^^^^^^ File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/load.py", line 2153, in load_dataset builder_instance.download_and_prepare( File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare self._download_and_prepare( File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/builder.py", line 1717, in _download_and_prepare super()._download_and_prepare( File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/builder.py", line 1555, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/builder.py", line 1712, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

RuntimeRacer commented 10 months ago

@jaggzh Hi, I actually came around with a fix for this, wasn't that easy to solve since there were a lot of hidden pitfalls in the code, and it's quite hacky, but I was able to download the full dataset.

I just didn't create a PR for it yet since I was too lazy to create a fork and change my local repo's origin. 😅 Let me try to do this tonight, I'll give you a ping once it's up.

EDIT: And no, what I wrote above about adding a param to the download config does NOT solve it apparently. A code fix is required here.

RuntimeRacer commented 10 months ago

@jaggzh PR is up: https://github.com/huggingface/datasets/pull/6380

🤞 on approval for merge to the main repo.

jaggzh commented 10 months ago

@mariosasko Can you re-open this? We really need some better diagnostics output, at the least, to locate which files are contributing, some checksum output, etc. I can't even tell if this is a mozilla...py issue or huggingface datasets or ....

jaggzh commented 10 months ago

@RuntimeRacer Beautiful, thank you so much. I patched with your PR and am re-running now. (I'm running this script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) Okay, actually it failed; so now I'm running with verification_mode='all_checks' added to the load_data() call and it's re-running now. Wish me luck. (Note: It's generating checksums; I don't see an option that handles anything between basic_checks and all_checks -- Something checking dl'ed files' lengths would be a good common fix I'd think; corruption is more rare nowadays than a short file (although maybe your patch helps prevent that in the first place.) :}

jaggzh commented 10 months ago

@RuntimeRacer No luck. Sigh. [Edit: My tmux copy didn't get some data. That was weird. I'm adding in the initial part of the output:]

Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 2190.69it/s]
Computing checksums: 100%|██████████| 41/41 [11:39<00:00, 17.05s/it]                                   Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 12.37it/s]
Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 107.64it/s]
Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 3149.82it/s]
Reading metadata...: 948736it [00:03, 243227.36it/s]s/s]
...

...
Reading metadata...: 252599it [00:01, 249267.71it/s]xamples/s]
Generating invalidated split: 60130 examples [00:31, 1916.33 examples/s]
Traceback (most recent call last):
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1676, in _prepare_split_single
for key, record in generator:
File "/home/j/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_11_0/3f27acf10f303eac5b6fbbbe02495aeddb46ecffdb0a2fe3507fcfbf89094631/common_voice_11_0.py", line 195, in _generate_examples
result["audio"] = {"path": path, "bytes": file.read()}
^^^^^^^^^^^
File "/usr/lib/python3.11/tarfile.py", line 687, in read
raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 627, in <module>
main()
File "/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 360, in main
raw_datasets["train"] = load_dataset(
^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1717, in _download_and_prepare
super()._download_and_prepare(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1555, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1712

mariosasko commented 10 months ago

I'm unable to reproduce this error. Based on https://github.com/psf/requests/issues/4956, newer releases of urllib3 check the returned content length by default, so perhaps updating requests and urllib3 to the latest versions (pip install -U requests urllib3) and loading the dataset with datasets.load_dataset("xtreme", "udpos.English", download_config=datasets.DownloadConfig(resume_download=True)) (re-run when it fails to resume the download) can fix the issue.

RuntimeRacer commented 10 months ago

@jaggzh I think you will need to re-download the whole dataset with my patched code. Files which have already been downloaded and marked as complete by the broken downloader won't be detected even on re-run (I described that in the PR). I also had to download reazonspeech, which is over 1TB, twice. 🙈 For re-download, you need to manually delete the dataset files from your local machine's huggingface download cache.

@mariosasko Not sure how you tested it, but it's not an issue in requests or urllib. The problem is the huggingface downloader, which generates a nested download thread for the actual download I think. The issue I had with the reazonspeech dataset (https://huggingface.co/datasets/reazon-research/reazonspeech/tree/main) basically was, that it started downloading a part, but sometimes the connection would 'starve' and only continue with a few kilobytes, and eventually stop receiving any data at all. Sometimes it would even recover during the download and finish properly. However, if it did not recover, the request would hit the really generous default timeout (which is 100 seconds I think), however the exception thrown by the failure inside urllib, isn't captured or handled by the upper level downloader code of the datasets library. datasets even has a retry mechanism, which would continue interrupted downloads if they have the .incomplete suffix, which isn't cleared if, for example, a manual CTRL+C is sent by the user to the python process. But: If it runs into that edge case I described above (TL;DR: connection starves after minutes + timeout exception which isn't captured), the cache downloader will consider the download as successful and remove the .incomplete suffix nevertheless, leaving the archive file in a corrupted state.

Honestly, I spent hours on trying to figure out what was even going on and why the retry mechanics of the cache downloader didn't work at all. But it is indeed an issue caused by the download process itself not receiving any info about actual content size and filesize size on disk of the archive to be downloaded, thus, having no direct control in case something fails on the request level.

IMHO, this requires a major refactor of the way this part of the downloader works. Yet I was able to quick-fix it by adding some synthetic Exception handling and explicit retry-handling in the code, als done in my PR.

jaggzh commented 10 months ago

@RuntimeRacer Ugh. It took a day. I'm seeing if I can get some debug code in here to examine the files myself. (I'm not sure why checksum tests would fail, so, yeah, I think you're right -- this stuff needs some work. Going through ipdb right now to try to get some idea of what's going on in the code).

mariosasko commented 10 months ago

@RuntimeRacer Data can only be appended to the .incomplete files if load_dataset is called with download_config=DownloadConfig(resume_download=True).

Where exactly does this exception happen (in the code)? The error stack trace would help a lot.

RuntimeRacer commented 10 months ago

@mariosasko I do not have a trace of this exception nor do I know which type it is. I am honestly not even sure if an exception is thrown, or the process just aborts without error.

@RuntimeRacer Data can only be appended to the .incomplete files if load_dataset is called with download_config=DownloadConfig(resume_download=True).

Well, I think I did a very clear explaination of the issue in the PR I shared, and the description above, but maybe I wasn't precise enough. Let me try to explain once more:

What you mention here is the "normal" case, if the process is aborted. In this case, there will be files with .incomplete suffix, which the cache downloader can continue to download. That is correct.

BUT: What I am talking about all the time is an edge case: if the download step crashes / timeouts internally, the cache downloader will NOT be aware of this, and REMOVES the .incomplete suffix. It does NOT know that the file is incomplete when the http_get function returns and will remove the .incomplete suffix in any case once http_get returns. But the problem is that http_get returns without failure, even if the download failed. And this is still a problem even with latest urllib and requests library.

mariosasko commented 10 months ago

@RuntimeRacer Updating urllib3 and requests to the latest versions fixes the issue explained in this blog post.

However, the issue explained above seems more similar to this one. To address it, we can reduce the default timeout to 10 seconds (btw, this was the initial value, but it was causing problems for some users) and expose a config variable so that users can easily control it. Additionally, we can re-run http_get similarly to https://github.com/huggingface/huggingface_hub/pull/1766 when the connection/timeout error happens to make the logic even more robust. Would this work for you? The last part is what you did in the PR, right?

@jaggzh From all the datasets mentioned in this issue, xtreme is the only one that stores the data file checksums in the metadata. So, the checksum check has no effect when enabled for the rest of the datasets.

jaggzh commented 10 months ago

(I don't have any .incomplete files, just the extraction errors.) I was going through the code to try to relate filenames to the hex/hash files, but realized I might not need to. So instead I coded up a script in bash to examine the tar files for validity (had an issue with bash subshells not adding to my array so I had cgpt recode it in perl).

#!/usr/bin/perl
use strict;
use warnings;

# Initialize the array to store tar files
my @tars;

# Open the current directory
opendir(my $dh, '.') or die "Cannot open directory: $!";

# Read files in the current directory
while (my $f = readdir($dh)) {
    # Skip files ending with lock, json, or py
    next if $f =~ /\.(lock|json|py)$/;

    # Use the `file` command to determine the type of file
    my $ft = `file "$f"`;

    # If it's a tar archive, add it to the list
    if ($ft =~ /tar archive/) {
        push @tars, $f;
    }
}

closedir($dh);

print "Final Tars count: " . scalar(@tars) . "\n";

# Iterate over the tar files and check them
foreach my $i (0 .. $#tars) {
    my $f = $tars[$i];
    printf '%d/%d ', $i+1, scalar(@tars);

    # Use `ls -lgG` to list the files, similar to the original bash script
    system("ls -lgG '$f'");

    # Check the integrity of the tar file
    my $errfn = "/tmp/$f.tarerr";
    if (system("tar tf '$f' > /dev/null 2> '$errfn'") != 0) {
        print "  BAD $f\n";
        print "  ERR: ";
        system("cat '$errfn'");
    }

    # Remove the error file if it exists
    unlink $errfn if -e $errfn;
}

This found one hash file that errored in the tar extraction, and one small tmp* file that also was supposedly a tar and was erroring. I removed those two and re-data loaded.. it grabbed just what it needed and I'm on my way. Yay!

So... is there a way for the datasets api to get file sizes? That would be a very easy and fast test, leaving checksum slowdowns for extra-messed-up situations.

RuntimeRacer commented 10 months ago

@RuntimeRacer Updating urllib3 and requests to the latest versions fixes the issue explained in this blog post.

However, the issue explained above seems more similar to this one. To address it, we can reduce the default timeout to 10 seconds (btw, this was the initial value, but it was causing problems for some users) and expose a config variable so that users can easily control it. Additionally, we can re-run http_get similarly to huggingface/huggingface_hub#1766 when the connection/timeout error happens to make the logic even more robust. Would this work for you? The last part is what you did in the PR, right?

@jaggzh From all the datasets mentioned in this issue, xtreme is the only one that stores the data file checksums in the metadata. So, the checksum check has no effect when enabled for the rest of the datasets.

@mariosasko Well if you look at my commit date, you will see that I run into this problem still in October. The blog post you mention and the update in the pull request for urllib was from July: https://github.com/psf/requests/issues/4956#issuecomment-1648632935

But yeah the issue on StackOverflow you mentioned seems like that's the source issue I was running into there. I experimented with timeouts, but changing them didn't help to resolve the issue of the starving connection unfortunately. However, https://github.com/huggingface/huggingface_hub/pull/1766 seems like that could be working; it's very similar to my change. So yeah I think this would fix it probably.

Also I can confirm the checksum option did not work for reazonspeech as well. So maybe it's a double edge case that only occurs for some datasets. 🤷‍♂️

jaggzh commented 10 months ago

Also, the hf urls to files -- while I can't see a way of getting a listing from the hf site side -- do include the file size in the http header response. So we do have a quick way of just verifying lengths for resume. (This message may not be interesting to you all).

First, a json clip (mozilla-foundation___common_voice_11_0/en/11.0.0/3f27acf10f303eac5b6fbbbe02495aeddb46ecffdb0a2fe3507fcfbf89094631/dataset_info.json):

I don't know how specific this .json is to mozilla common voice
Note that dataset_size is not the dataset size :) DatasetInfo class docs indicate it might be their "combined size in bytes of the Arrow tables for all splits."

num_bytes: does match the individual file size though, and matches the http header (further down)

{
"builder_name" : "common_voice_11_0",
...
"config_name" : "en",
"dataset_name" : "common_voice_11_0",
"dataset_size" : 1680793952,
...
"download_checksums" : {
...
  "https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/resolve/main/audio/en/invalidated/en_invalidated_3.tar" : {
     "checksum" : null,
     "num_bytes" : 2110853120
  },
...

~/.cache/huggingface/datasets/downloads$ ls -lgG b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40* | cut -c 14-

2110853120 Nov  1 16:28 b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40
148 Nov  1 16:28 b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40.json
0 Nov  1 16:07 b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40.lock

Note the -L to follow redirects. Two headers are below:

$ curl -I -L https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/resolve/main/audio/en/invalidated/en_invalidated_3.tar

HTTP/2 302 
content-type: text/plain; charset=utf-8
content-length: 1215
location: https://cdn-lfs.huggingface.co/repos/00/ce/00ce867b4ae70bd23a10b60c32a8626d87b2666fc088ad03f86b94788faff554/984086fc250badece2992e8be4d7c4430f7c1208fb8bf37dc7c4aecdc803b220?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27en_invalidated_3.tar%3B+filename%3D%22en_invalidated_3.tar%22%3B&response-content-type=application%2Fx-tar&Expires=1699389040&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY5OTM4OTA0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy8wMC9jZS8wMGNlODY3YjRhZTcwYmQyM2ExMGI2MGMzMmE4NjI2ZDg3YjI2NjZmYzA4OGFkMDNmODZiOTQ3ODhmYWZmNTU0Lzk4NDA4NmZjMjUwYmFkZWNlMjk5MmU4YmU0ZDdjNDQzMGY3YzEyMDhmYjhiZjM3ZGM3YzRhZWNkYzgwM2IyMjA%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=WYc32e75PqbKSAv3KTpG86ooFT6oOyDDQpCt1i2B8gVS10J3qvpZlDmxaBgnGlCCl7SRiAvhIQctgwooNtWbUeDqK3T4bAo0-OOrGCuVi-%7EKWUBcoHce7nHWpl%7Ex9ubHS%7EFoYcGB2SCEqh5fIgGjNV-VKRX6TSXkRto5bclQq4VCJKHufDsJ114A1V4Qu%7EYiRIWKG4Gi93Xv4OFhyWY0uqykvP5c0x02F%7ELX0m3WbW-eXBk6Fw2xnV1XLrEkdR-9Ax2vHqMYIIw6yV0wWEc1hxE393P9mMG1TNDj%7EXDuCoOaA7LbrwBCxai%7Ew2MopdPamTXyOia5-FnSqEdsV29v4Q__&Key-Pair-Id=KVTP0A1DKRTAX
date: Sat, 04 Nov 2023 20:30:40 GMT
x-powered-by: huggingface-moon
x-request-id: Root=1-6546a9f0-5e7f729d09bdb38e35649a7e
access-control-allow-origin: https://huggingface.co
vary: Origin, Accept
access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,ETag,Link,Accept-Ranges,Content-Range
x-repo-commit: 23b4059922516c140711b91831aa3393a22e9b80
accept-ranges: bytes
x-linked-size: 2110853120
x-linked-etag: "984086fc250badece2992e8be4d7c4430f7c1208fb8bf37dc7c4aecdc803b220"
x-cache: Miss from cloudfront
via: 1.1 f31a6426ebd75ce4393909b12f5cbdcc.cloudfront.net (CloudFront)
x-amz-cf-pop: LAX53-P4
x-amz-cf-id: BcYMFcHVcxPome2IjAvx0ZU90G41QlNI_HEHDGDqCQaEPvrOsnsGXw==

HTTP/2 200 
content-type: application/x-tar
content-length: 2110853120
date: Sat, 04 Nov 2023 20:19:35 GMT
last-modified: Fri, 18 Nov 2022 15:08:22 GMT
etag: "acac28988e2f7e73b68e865179fbd008"
x-amz-storage-class: INTELLIGENT_TIERING
x-amz-version-id: LgTuOcd9FGN4JnAXp26O.1v2VW42GPtF
content-disposition: attachment; filename*=UTF-8''en_invalidated_3.tar; filename="en_invalidated_3.tar";
accept-ranges: bytes
server: AmazonS3
x-cache: Hit from cloudfront
via: 1.1 d07c8167eda81d307ca96358727f505e.cloudfront.net (CloudFront)
x-amz-cf-pop: LAX50-P5
x-amz-cf-id: 6oNZg_V8U1M_JXsMHQAPuRmDfxbY2BnMUWcVH0nz3VnfEZCzF5lgkQ==
age: 666
cache-control: public, max-age=604800, immutable, s-maxage=604800
vary: Origin