Closed forrestbao closed 4 years ago
Thanks for reporting. Could you try to download the file manually and create a small python script to extract it to see it you get the same error ?
path = '/path/to/file.zip'
with tf.io.gfile.open(path, 'rb') as fobj:
z = zipfile.ZipFile(fobj)
for member in z.infolist():
extract_file = z.open(member)
print(member.filename)
I got the same error as the original poster and the same error when trying to extract the file with python as suggested (tried python 3.6 and 3.7). Just running unzip on the file gives me the data without a problem.
Copying the stacktrace here for reference.
Downloading and preparing dataset scientific_papers (4.20 GiB) to /home/forrest/tensorflow_datasets/scientific_papers/arxiv/1.1.0...
Extraction completed...: 0%| | 0/2 [00:07<?, ? file/s]
Dl Size...: 0 MiB [00:07, ? MiB/s]
Dl Completed...: 0 url [00:07, ? url/s] | 0/2 [00:00<?, ? file/s]
Traceback (most recent call last):
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/extractor.py", line 92, in _sync_extract
_copy(handle, path and os.path.join(to_path_tmp, path) or to_path_tmp)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/extractor.py", line 111, in _copy
data = src_file.read(io.DEFAULT_BUFFER_SIZE)
File "/usr/lib/python3.6/zipfile.py", line 872, in read
data = self._read1(n)
File "/usr/lib/python3.6/zipfile.py", line 962, in _read1
self._update_crc(data)
File "/usr/lib/python3.6/zipfile.py", line 890, in _update_crc
raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 'arxiv-release/train.txt'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tfds_download.py", line 5, in <module>
_ = tfds.load(dataset)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
return fn(*args, **kwargs)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/registered.py", line 302, in load
dbuilder.download_and_prepare(**download_and_prepare_kwargs)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
return fn(*args, **kwargs)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 322, in download_and_prepare
download_config=download_config)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 969, in _download_and_prepare
max_examples_per_split=download_config.max_examples_per_split,
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 825, in _download_and_prepare
for split_generator in self._split_generators(dl_manager):
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/summarization/scientific_papers.py", line 108, in _split_generators
dl_paths = dl_manager.download_and_extract(_URLS)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/download_manager.py", line 374, in download_and_extract
return _map_promise(self._download_extract, url_or_urls)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/download_manager.py", line 415, in _map_promise
res = utils.map_nested(_wait_on_promise, all_promises)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 145, in map_nested
for k, v in data_struct.items()
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 145, in <dictcomp>
for k, v in data_struct.items()
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 159, in map_nested
return function(data_struct)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/download_manager.py", line 399, in _wait_on_promise
return p.get()
File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 512, in get
return self._target_settled_value(_raise=True)
File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 516, in _target_settled_value
return self._target()._settled_value(_raise)
File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 226, in _settled_value
reraise(type(raise_val), raise_val, self._traceback)
File "/home/forrest/.local/lib/python3.6/site-packages/six.py", line 696, in reraise
raise value
File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 844, in handle_future_result
resolve(future.result())
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/extractor.py", line 96, in _sync_extract
raise ExtractError(msg)
tensorflow_datasets.core.download.extractor.ExtractError: Error while extracting /home/forrest/tensorflow_datasets/downloads/ucexport_download_id_1K2kDBTNXS2ikx9xKmi2_Lls0d7kG6NhLh-6Qt4tNUbCaluqBYTWYYmp4CluxakNVBI4 to /home/forrest/tensorflow_datasets/downloads/extracted/ZIP.ucexport_download_id_1K2kDBTNXS2ikx9xKmi2_Lls0d7kG6NhLh-6Qt4tNUbCaluqBYTWYYmp4CluxakNVBI4 (file: arxiv-release/train.txt) : Bad CRC-32 for file 'arxiv-release/train.txt'
@Conchylicultor I manually download data and running above script
path = '/path/to/file.zip' with tf.io.gfile.open(path, 'rb') as fobj: z = zipfile.ZipFile(fobj) for member in z.infolist(): extract_file = z.open(member) print(member.filename) I got
AttributeError: module 'tensorflow_core._api.v2.io.gfile' has no attribute 'open'
so i checked and replacetf.io.gfile.open
withtf.io.gfile.Gfile
and it worked.
After running above script on manually download scientific_papers
data, Output is this
pubmed-release/ pubmed-release/train.txt MACOSX/ MACOSX/pubmed-release/ MACOSX/pubmed-release/._train.txt pubmed-release/vocab __MACOSX/pubmed-release/._vocab pubmed-release/test.txt MACOSX/pubmed-release/._test.txt pubmed-release/val.txt __MACOSX/pubmed-release/._val.txt __MACOSX/._pubmed-release
means it runs successfully but got BAD-CRC-32
error when using tfds.load("scientific_papers")
So is there is anything we can change in iter_zip
function in core/extractor.py
.
If it is I like to work on it.
You should use tf.io.gfile.GFile
instead of open
You should use
tf.io.gfile.GFile
instead ofopen
Yes I am using tf.io.gfile.GFile
, but got error BAD-CRC-32
when not download data manually instead use tfds.load().
@Conchylicultor I think there is nothing wrong with anything its just train.txt files in archives is corrupted. When I download data and delete train.txt from it. Everything runs fine.
@Eshan-Agarwal if you understand the bug and know a way to fix it, then please send a PR if you have time. I'm not sure I understand what the bug is.
Okay I will try.
@Conchylicultor I found that in scientific_papers
dataset zip file train.txt
is corrupted so it raises this BAD-CRC-32
error but I can extract data using Winrar
app. So I think there is two options to resolve this issue one is raises more efficient exception that data is corrupted
and other I can upload it to drive so and update scientific_papers.py
and checksums
file. So what should I do ?
It seems someone got the same issue with the same url: https://stackoverflow.com/questions/56204755/zipfile-extractall-raising-badzipfile-bad-crc-32-for-file-error
I'm wondering if this only happens for a specific version of Python/zip lib because it seems extraction worked before.
other I can upload it to drive so and update
We are not allowed to redistribute datasets from other ourselves. Thank you for contacting the original authors. Hopefully they will update their file.
@Conchylicultor Thanks for helping , hope they update data soon.
@Conchylicultor They update data, I send you PR today with updated links and checksums
Thank you fox fixing this
@forrestbao May I ask how you successfully downloaded big_patent
using tfds.load
? I tried ds = tfds.load('big_patent')
, but it threw out an error as follows:
ValueError: The dataset you're trying to generate is using Apache Beam. Beam datasets are usually very large and should be generated separately.Please have a look athttps://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_datasetfor instructions.
I tried using apache beam but I got OOM on my server. #2146
Thanks!
@jieralice13 I think the problem has been fixed now.
Short description I have been using the
tfds.load
function to download and split many other summarization datasets,billsum
,cnn_dailymail
, andbig_patent
, without any problem. When trying to do so withscientific_papers
, I ran into Bad CRC-32 error when trying to load scientific_paper dataset in the extraction step. However, I could manually unzip the file without any CRC-32 error. This happens to botharxiv
andpubmed
underscientific_papers
. This might be a bug with Python 3.6's zip extractor.Environment information
tensorflow-datasets
/tfds-nightly
version:tensorflow-datasets 1.3.2
tensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version:tensorflow-2.0.0
Reproduction instructions
Link to logs traceback.log
Expected behavior There shouldn't be errors but messages that the dataset has been successfully extracted, shuffled and split.