Closed simran-khanuja closed 1 year ago
Hi! I cannot reproduce this error on my machine.
The raised error could mean that one of the downloaded files is corrupted. To verify this is not the case, you can run load_dataset
as follows:
train_dataset = load_dataset('xtreme', 'udpos.English', split="train", cache_dir=args.cache_dir, download_mode="force_redownload", verification_mode="all_checks")
Hi! Apologies for the delayed response! I tried the above and it doesn't solve the issue. Actually, the dataset gets downloaded most times, but sometimes this error occurs (at random afaik). Is it possible that there is a server issue for this particular dataset? I am able to download other datasets using the same code on the same machine with no issues :( I get this error now :
Downloading data: 16%|███████████████▌ | 55.9M/355M [04:45<25:25, 196kB/s]
Traceback (most recent call last):
File "/home/skhanuja/Optimal-Resource-Allocation-for-Multilingual-Finetuning/src/train_al.py", line 1107, in <module>
main()
File "/home/skhanuja/Optimal-Resource-Allocation-for-Multilingual-Finetuning/src/train_al.py", line 439, in main
en_dataset = load_dataset("xtreme", "udpos.English", split="train", download_mode="force_redownload", verification_mode="all_checks")
File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset
builder_instance.download_and_prepare(
File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/builder.py", line 949, in _download_and_prepare
verify_checksums(
File "/home/skhanuja/miniconda3/envs/multilingual_ft/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 62, in verify_checksums
raise NonMatchingChecksumError(
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz']
Set `verification_mode='no_checks'` to skip checksums verification and ignore this error
If this happens randomly, then this means the data file from the error message is not always downloaded correctly.
The only solution in this scenario is to download the dataset again by passing download_mode="force_redownload"
to the load_dataset
call.
Wow. I effectively have to redownload a dataset of 1TB because of this now? Because 3% of its parts are broken?
Why is this downloader library so sh*t and badly documented also? I found almost nothing on the net, at least finally this issue about the problem here. No words to express how disappointed I am by that dataset tool provided by Huggingface here, which I sadly have to use because HF is the only place where the Dataset I plan to work with is hosted....
I mean... checksum check after download... or hitting timeout of a part... and redownload if not matching... that's content of every junior developer training session.
I added verification_mode="all_checks"
. And it really calculated checksums for 4096 parts of ~350 MB... But then did nothing and tried to extract still, hitting the error again.
EDIT: Apparently it is able to fix it by getting a little help: Just delete the broken parts and associated files from ~/.cache/huggingface/datasets/downloads
I'm getting it too, although just retrying fixed it. Nevertheless, the dataset is too large to have re-downloaded the whole thing, for it's probably just one file with an issue. It would be good to know if there's a way people could manually examine the files (first for sizes, then possibly checksums)... going to the web or elsewhere to compare and correct it by hand, if ever needed.
Okay, no, it got further but it is repeatedly giving me:
result["audio"] = {"path": path, "bytes": file.read()}
^^^^^^^^^^^
File "/usr/lib/python3.11/tarfile.py", line 687, in read
raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jaggz/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 625, in <module>
main()
File "/home/jaggz/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 360, in main
raw_datasets["train"] = load_dataset(
^^^^^^^^^^^^^
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1717, in _download_and_prepare
super()._download_and_prepare(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1555, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/jaggz/venvs/pynow/lib/python3.11/site-packages/datasets/builder.py", line 1712, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the datase
@RuntimeRacer
EDIT: Apparently it is able to fix it by getting a little help: Just delete the broken parts and associated files from
~/.cache/huggingface/datasets/downloads
How do you know the broken parts? Mine's consistently erroring and.. yeah, really this thing should be able to check the files (but where's that even done)...
2023-11-02 00:14:09.846055: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py:299: FutureWarning: The use_auth_token
argument is deprecated and will be removed in v4.34. Please use token
instead.
warnings.warn(
11/02/2023 00:14:37 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True
11/02/2023 00:14:37 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
...
logging_dir=./whisper-tiny-en/runs/Nov02_00-14-28_jsys,
...
run_name=./whisper-tiny-en,
...
weight_decay=0.0,
)
11/02/2023 00:14:37 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
...
logging_dir=./whisper-tiny-en/runs/Nov02_00-14-28_jsys,
...
weight_decay=0.0,
)
Downloading data files: 0%| | 0/5 [00:00<?, ?it/s] Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 2426.42it/s]
Extracting data files: 0%| | 0/5 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 421.16it/s]
Downloading data files: 0%| | 0/5 [00:00<?, ?it/s] Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 18707.87it/s]
Extracting data files: 0%| | 0/5 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 3754.97it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Reading metadata...: 0it [00:00, ?it/s][A ... Reading metadata...: 948736it [00:23, 40632.92it/s]
Generating train split: 1 examples [00:23, 23.37s/ examples] ... Generating train split: 948736 examples [08:28, 1866.15 examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 16089it [00:00, 157411.88it/s][A Reading metadata...: 16354it [00:00, 158233.27it/s]
Generating validation split: 1 examples [00:00, 7.60 examples/s] Generating validation split: 16354 examples [00:14, 1154.77 examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
Reading metadata...: 0it [00:00, ?it/s][A Reading metadata...: 16354it [00:00, 194855.03it/s]
Generating test split: 1 examples [00:00, 4.53 examples/s] Generating test split: 16354 examples [00:07, 2105.43 examples/s]
Generating other split: 0 examples [00:00, ? examples/s]
Reading metadata...: 0it [00:00, ?it/s][A Reading metadata...: 290846it [00:01, 235823.90it/s]
Generating other split: 1 examples [00:01, 1.27s/ examples] ... Generating other split: 290846 examples [02:12, 2196.96 examples/s] Generating invalidated split: 0 examples [00:00, ? examples/s] Reading metadata...: 252599it [00:01, 241965.85it/s]
Generating invalidated split: 1 examples [00:01, 1.08s/ examples] ... Generating invalidated split: 60130 examples [00:34, 1764.14 examples/s] Traceback (most recent call last): File "/home/j/venvs/pycur/lib/python3.11/site-packages/datasets/builder.py", line 1676, in _prepare_split_single for key, record in generator: File "/home/j/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_11_0/3f27acf10f303eac5b6fbbbe02495aeddb46ecffdb0a2fe3507fcfbf89094631/common_voice_11_0.py", line 195, in _generate_examples result["audio"] = {"path": path, "bytes": file.read()} ^^^^^^^^^^^ File "/usr/lib/python3.11/tarfile.py", line 687, in read raise ReadError("unexpected end of data") tarfile.ReadError: unexpected end of data
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 625, in
@jaggzh Hi, I actually came around with a fix for this, wasn't that easy to solve since there were a lot of hidden pitfalls in the code, and it's quite hacky, but I was able to download the full dataset.
I just didn't create a PR for it yet since I was too lazy to create a fork and change my local repo's origin. 😅 Let me try to do this tonight, I'll give you a ping once it's up.
EDIT: And no, what I wrote above about adding a param to the download config does NOT solve it apparently. A code fix is required here.
@jaggzh PR is up: https://github.com/huggingface/datasets/pull/6380
🤞 on approval for merge to the main repo.
@mariosasko Can you re-open this? We really need some better diagnostics output, at the least, to locate which files are contributing, some checksum output, etc. I can't even tell if this is a mozilla...py issue or huggingface datasets or ....
@RuntimeRacer Beautiful, thank you so much. I patched with your PR and am re-running now. (I'm running this script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) Okay, actually it failed; so now I'm running with verification_mode='all_checks' added to the load_data() call and it's re-running now. Wish me luck. (Note: It's generating checksums; I don't see an option that handles anything between basic_checks and all_checks -- Something checking dl'ed files' lengths would be a good common fix I'd think; corruption is more rare nowadays than a short file (although maybe your patch helps prevent that in the first place.) :}
@RuntimeRacer No luck. Sigh. [Edit: My tmux copy didn't get some data. That was weird. I'm adding in the initial part of the output:]
Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 2190.69it/s]
Computing checksums: 100%|██████████| 41/41 [11:39<00:00, 17.05s/it] Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 12.37it/s]
Downloading data files: 100%|██████████| 5/5 [00:00<00:00, 107.64it/s]
Extracting data files: 100%|██████████| 5/5 [00:00<00:00, 3149.82it/s]
Reading metadata...: 948736it [00:03, 243227.36it/s]s/s]
...
...
Reading metadata...: 252599it [00:01, 249267.71it/s]xamples/s]
Generating invalidated split: 60130 examples [00:31, 1916.33 examples/s]
Traceback (most recent call last):
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1676, in _prepare_split_single
for key, record in generator:
File "/home/j/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_11_0/3f27acf10f303eac5b6fbbbe02495aeddb46ecffdb0a2fe3507fcfbf89094631/common_voice_11_0.py", line 195, in _generate_examples
result["audio"] = {"path": path, "bytes": file.read()}
^^^^^^^^^^^
File "/usr/lib/python3.11/tarfile.py", line 687, in read
raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 627, in <module>
main()
File "/home/j/src/transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py", line 360, in main
raw_datasets["train"] = load_dataset(
^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1717, in _download_and_prepare
super()._download_and_prepare(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1555, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/j/src/py/datasets/src/datasets/builder.py", line 1712
I'm unable to reproduce this error. Based on https://github.com/psf/requests/issues/4956, newer releases of urllib3
check the returned content length by default, so perhaps updating requests
and urllib3
to the latest versions (pip install -U requests urllib3
) and loading the dataset with datasets.load_dataset("xtreme", "udpos.English", download_config=datasets.DownloadConfig(resume_download=True))
(re-run when it fails to resume the download) can fix the issue.
@jaggzh I think you will need to re-download the whole dataset with my patched code. Files which have already been downloaded and marked as complete by the broken downloader won't be detected even on re-run (I described that in the PR). I also had to download reazonspeech, which is over 1TB, twice. 🙈 For re-download, you need to manually delete the dataset files from your local machine's huggingface download cache.
@mariosasko Not sure how you tested it, but it's not an issue in requests
or urllib
. The problem is the huggingface downloader, which generates a nested download thread for the actual download I think.
The issue I had with the reazonspeech dataset (https://huggingface.co/datasets/reazon-research/reazonspeech/tree/main) basically was, that it started downloading a part, but sometimes the connection would 'starve' and only continue with a few kilobytes, and eventually stop receiving any data at all.
Sometimes it would even recover during the download and finish properly.
However, if it did not recover, the request would hit the really generous default timeout (which is 100 seconds I think), however the exception thrown by the failure inside urllib
, isn't captured or handled by the upper level downloader code of the datasets
library.
datasets
even has a retry mechanism, which would continue interrupted downloads if they have the .incomplete
suffix, which isn't cleared if, for example, a manual CTRL+C
is sent by the user to the python process.
But: If it runs into that edge case I described above (TL;DR: connection starves after minutes + timeout exception which isn't captured), the cache downloader will consider the download as successful and remove the .incomplete
suffix nevertheless, leaving the archive file in a corrupted state.
Honestly, I spent hours on trying to figure out what was even going on and why the retry mechanics of the cache downloader didn't work at all. But it is indeed an issue caused by the download process itself not receiving any info about actual content size and filesize size on disk of the archive to be downloaded, thus, having no direct control in case something fails on the request level.
IMHO, this requires a major refactor of the way this part of the downloader works. Yet I was able to quick-fix it by adding some synthetic Exception handling and explicit retry-handling in the code, als done in my PR.
@RuntimeRacer Ugh. It took a day. I'm seeing if I can get some debug code in here to examine the files myself. (I'm not sure why checksum tests would fail, so, yeah, I think you're right -- this stuff needs some work. Going through ipdb right now to try to get some idea of what's going on in the code).
@RuntimeRacer Data can only be appended to the .incomplete
files if load_dataset
is called with download_config=DownloadConfig(resume_download=True)
.
Where exactly does this exception happen (in the code)? The error stack trace would help a lot.
@mariosasko I do not have a trace of this exception nor do I know which type it is. I am honestly not even sure if an exception is thrown, or the process just aborts without error.
@RuntimeRacer Data can only be appended to the .incomplete files if load_dataset is called with download_config=DownloadConfig(resume_download=True).
Well, I think I did a very clear explaination of the issue in the PR I shared, and the description above, but maybe I wasn't precise enough. Let me try to explain once more:
What you mention here is the "normal" case, if the process is aborted. In this case, there will be files with .incomplete
suffix, which the cache downloader can continue to download. That is correct.
BUT: What I am talking about all the time is an edge case: if the download step crashes / timeouts internally, the cache downloader will NOT be aware of this, and REMOVES the .incomplete
suffix.
It does NOT know that the file is incomplete when the http_get
function returns and will remove the .incomplete
suffix in any case once http_get
returns.
But the problem is that http_get
returns without failure, even if the download failed.
And this is still a problem even with latest urllib
and requests
library.
@RuntimeRacer Updating urllib3
and requests
to the latest versions fixes the issue explained in this blog post.
However, the issue explained above seems more similar to this one. To address it, we can reduce the default timeout to 10 seconds (btw, this was the initial value, but it was causing problems for some users) and expose a config variable so that users can easily control it. Additionally, we can re-run http_get
similarly to https://github.com/huggingface/huggingface_hub/pull/1766 when the connection/timeout error happens to make the logic even more robust. Would this work for you? The last part is what you did in the PR, right?
@jaggzh From all the datasets mentioned in this issue, xtreme
is the only one that stores the data file checksums in the metadata. So, the checksum check has no effect when enabled for the rest of the datasets.
(I don't have any .incomplete files, just the extraction errors.) I was going through the code to try to relate filenames to the hex/hash files, but realized I might not need to. So instead I coded up a script in bash to examine the tar files for validity (had an issue with bash subshells not adding to my array so I had cgpt recode it in perl).
#!/usr/bin/perl
use strict;
use warnings;
# Initialize the array to store tar files
my @tars;
# Open the current directory
opendir(my $dh, '.') or die "Cannot open directory: $!";
# Read files in the current directory
while (my $f = readdir($dh)) {
# Skip files ending with lock, json, or py
next if $f =~ /\.(lock|json|py)$/;
# Use the `file` command to determine the type of file
my $ft = `file "$f"`;
# If it's a tar archive, add it to the list
if ($ft =~ /tar archive/) {
push @tars, $f;
}
}
closedir($dh);
print "Final Tars count: " . scalar(@tars) . "\n";
# Iterate over the tar files and check them
foreach my $i (0 .. $#tars) {
my $f = $tars[$i];
printf '%d/%d ', $i+1, scalar(@tars);
# Use `ls -lgG` to list the files, similar to the original bash script
system("ls -lgG '$f'");
# Check the integrity of the tar file
my $errfn = "/tmp/$f.tarerr";
if (system("tar tf '$f' > /dev/null 2> '$errfn'") != 0) {
print " BAD $f\n";
print " ERR: ";
system("cat '$errfn'");
}
# Remove the error file if it exists
unlink $errfn if -e $errfn;
}
This found one hash file that errored in the tar extraction, and one small tmp* file that also was supposedly a tar and was erroring. I removed those two and re-data loaded.. it grabbed just what it needed and I'm on my way. Yay!
So... is there a way for the datasets api to get file sizes? That would be a very easy and fast test, leaving checksum slowdowns for extra-messed-up situations.
@RuntimeRacer Updating
urllib3
andrequests
to the latest versions fixes the issue explained in this blog post.However, the issue explained above seems more similar to this one. To address it, we can reduce the default timeout to 10 seconds (btw, this was the initial value, but it was causing problems for some users) and expose a config variable so that users can easily control it. Additionally, we can re-run
http_get
similarly to huggingface/huggingface_hub#1766 when the connection/timeout error happens to make the logic even more robust. Would this work for you? The last part is what you did in the PR, right?@jaggzh From all the datasets mentioned in this issue,
xtreme
is the only one that stores the data file checksums in the metadata. So, the checksum check has no effect when enabled for the rest of the datasets.
@mariosasko Well if you look at my commit date, you will see that I run into this problem still in October. The blog post you mention and the update in the pull request for urllib
was from July: https://github.com/psf/requests/issues/4956#issuecomment-1648632935
But yeah the issue on StackOverflow you mentioned seems like that's the source issue I was running into there. I experimented with timeouts, but changing them didn't help to resolve the issue of the starving connection unfortunately. However, https://github.com/huggingface/huggingface_hub/pull/1766 seems like that could be working; it's very similar to my change. So yeah I think this would fix it probably.
Also I can confirm the checksum option did not work for reazonspeech as well. So maybe it's a double edge case that only occurs for some datasets. 🤷♂️
Also, the hf urls to files -- while I can't see a way of getting a listing from the hf site side -- do include the file size in the http header response. So we do have a quick way of just verifying lengths for resume. (This message may not be interesting to you all).
First, a json clip (mozilla-foundation___common_voice_11_0/en/11.0.0/3f27acf10f303eac5b6fbbbe02495aeddb46ecffdb0a2fe3507fcfbf89094631/dataset_info.json):
{
"builder_name" : "common_voice_11_0",
...
"config_name" : "en",
"dataset_name" : "common_voice_11_0",
"dataset_size" : 1680793952,
...
"download_checksums" : {
...
"https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/resolve/main/audio/en/invalidated/en_invalidated_3.tar" : {
"checksum" : null,
"num_bytes" : 2110853120
},
...
~/.cache/huggingface/datasets/downloads$ ls -lgG b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40* | cut -c 14-
2110853120 Nov 1 16:28 b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40
148 Nov 1 16:28 b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40.json
0 Nov 1 16:07 b45f82cb87bab2c35361857fcd46042ab658b42c37dc9a455248c2866c9b8f40.lock
$ curl -I -L https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/resolve/main/audio/en/invalidated/en_invalidated_3.tar
HTTP/2 302
content-type: text/plain; charset=utf-8
content-length: 1215
location: https://cdn-lfs.huggingface.co/repos/00/ce/00ce867b4ae70bd23a10b60c32a8626d87b2666fc088ad03f86b94788faff554/984086fc250badece2992e8be4d7c4430f7c1208fb8bf37dc7c4aecdc803b220?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27en_invalidated_3.tar%3B+filename%3D%22en_invalidated_3.tar%22%3B&response-content-type=application%2Fx-tar&Expires=1699389040&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY5OTM4OTA0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy8wMC9jZS8wMGNlODY3YjRhZTcwYmQyM2ExMGI2MGMzMmE4NjI2ZDg3YjI2NjZmYzA4OGFkMDNmODZiOTQ3ODhmYWZmNTU0Lzk4NDA4NmZjMjUwYmFkZWNlMjk5MmU4YmU0ZDdjNDQzMGY3YzEyMDhmYjhiZjM3ZGM3YzRhZWNkYzgwM2IyMjA%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=WYc32e75PqbKSAv3KTpG86ooFT6oOyDDQpCt1i2B8gVS10J3qvpZlDmxaBgnGlCCl7SRiAvhIQctgwooNtWbUeDqK3T4bAo0-OOrGCuVi-%7EKWUBcoHce7nHWpl%7Ex9ubHS%7EFoYcGB2SCEqh5fIgGjNV-VKRX6TSXkRto5bclQq4VCJKHufDsJ114A1V4Qu%7EYiRIWKG4Gi93Xv4OFhyWY0uqykvP5c0x02F%7ELX0m3WbW-eXBk6Fw2xnV1XLrEkdR-9Ax2vHqMYIIw6yV0wWEc1hxE393P9mMG1TNDj%7EXDuCoOaA7LbrwBCxai%7Ew2MopdPamTXyOia5-FnSqEdsV29v4Q__&Key-Pair-Id=KVTP0A1DKRTAX
date: Sat, 04 Nov 2023 20:30:40 GMT
x-powered-by: huggingface-moon
x-request-id: Root=1-6546a9f0-5e7f729d09bdb38e35649a7e
access-control-allow-origin: https://huggingface.co
vary: Origin, Accept
access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,ETag,Link,Accept-Ranges,Content-Range
x-repo-commit: 23b4059922516c140711b91831aa3393a22e9b80
accept-ranges: bytes
x-linked-size: 2110853120
x-linked-etag: "984086fc250badece2992e8be4d7c4430f7c1208fb8bf37dc7c4aecdc803b220"
x-cache: Miss from cloudfront
via: 1.1 f31a6426ebd75ce4393909b12f5cbdcc.cloudfront.net (CloudFront)
x-amz-cf-pop: LAX53-P4
x-amz-cf-id: BcYMFcHVcxPome2IjAvx0ZU90G41QlNI_HEHDGDqCQaEPvrOsnsGXw==
HTTP/2 200
content-type: application/x-tar
content-length: 2110853120
date: Sat, 04 Nov 2023 20:19:35 GMT
last-modified: Fri, 18 Nov 2022 15:08:22 GMT
etag: "acac28988e2f7e73b68e865179fbd008"
x-amz-storage-class: INTELLIGENT_TIERING
x-amz-version-id: LgTuOcd9FGN4JnAXp26O.1v2VW42GPtF
content-disposition: attachment; filename*=UTF-8''en_invalidated_3.tar; filename="en_invalidated_3.tar";
accept-ranges: bytes
server: AmazonS3
x-cache: Hit from cloudfront
via: 1.1 d07c8167eda81d307ca96358727f505e.cloudfront.net (CloudFront)
x-amz-cf-pop: LAX50-P5
x-amz-cf-id: 6oNZg_V8U1M_JXsMHQAPuRmDfxbY2BnMUWcVH0nz3VnfEZCzF5lgkQ==
age: 666
cache-control: public, max-age=604800, immutable, s-maxage=604800
vary: Origin
Describe the bug
Hi,
I am facing an error while downloading the xtreme udpos dataset using load_dataset. I have datasets 2.10.1 installed
Steps to reproduce the bug
Expected behavior
Download the udpos dataset
Environment info
datasets
version: 2.10.1