Bad CRC-32 when extracting scientific_papers

forrestbao commented 4 years ago

Short description I have been using the tfds.load function to download and split many other summarization datasets, billsum, cnn_dailymail, and big_patent, without any problem. When trying to do so with scientific_papers, I ran into Bad CRC-32 error when trying to load scientific_paper dataset in the extraction step. However, I could manually unzip the file without any CRC-32 error. This happens to both arxiv and pubmed under scientific_papers. This might be a bug with Python 3.6's zip extractor.

Environment information

Operating System: Ubuntu Linux 18.04
Python version: 3.6 (Ubuntu system's)
tensorflow-datasets/tfds-nightly version: tensorflow-datasets 1.3.2
tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow-2.0.0

Reproduction instructions

import tensorflow_datasets as tfds
    _ = tfds.load("scientific_papers")

Link to logs traceback.log

Expected behavior There shouldn't be errors but messages that the dataset has been successfully extracted, shuffled and split.

Conchylicultor commented 4 years ago

Thanks for reporting. Could you try to download the file manually and create a small python script to extract it to see it you get the same error ?

path = '/path/to/file.zip'
with tf.io.gfile.open(path, 'rb') as fobj:
  z = zipfile.ZipFile(fobj)
  for member in z.infolist():
    extract_file = z.open(member)
    print(member.filename)

Alex-Fabbri commented 4 years ago

I got the same error as the original poster and the same error when trying to extract the file with python as suggested (tried python 3.6 and 3.7). Just running unzip on the file gives me the data without a problem.

Conchylicultor commented 4 years ago

Copying the stacktrace here for reference.

Downloading and preparing dataset scientific_papers (4.20 GiB) to /home/forrest/tensorflow_datasets/scientific_papers/arxiv/1.1.0...
Extraction completed...:   0%|                                                     | 0/2 [00:07<?, ? file/s]
Dl Size...: 0 MiB [00:07, ? MiB/s]
Dl Completed...: 0 url [00:07, ? url/s]                                            | 0/2 [00:00<?, ? file/s]
Traceback (most recent call last):
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/extractor.py", line 92, in _sync_extract
    _copy(handle, path and os.path.join(to_path_tmp, path) or to_path_tmp)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/extractor.py", line 111, in _copy
    data = src_file.read(io.DEFAULT_BUFFER_SIZE)
  File "/usr/lib/python3.6/zipfile.py", line 872, in read
    data = self._read1(n)
  File "/usr/lib/python3.6/zipfile.py", line 962, in _read1
    self._update_crc(data)
  File "/usr/lib/python3.6/zipfile.py", line 890, in _update_crc
    raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 'arxiv-release/train.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tfds_download.py", line 5, in <module>
    _ = tfds.load(dataset)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/registered.py", line 302, in load
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/api_utils.py", line 52, in disallow_positional_args_dec
    return fn(*args, **kwargs)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 322, in download_and_prepare
    download_config=download_config)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 969, in _download_and_prepare
    max_examples_per_split=download_config.max_examples_per_split,
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/dataset_builder.py", line 825, in _download_and_prepare
    for split_generator in self._split_generators(dl_manager):
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/summarization/scientific_papers.py", line 108, in _split_generators
    dl_paths = dl_manager.download_and_extract(_URLS)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/download_manager.py", line 374, in download_and_extract
    return _map_promise(self._download_extract, url_or_urls)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/download_manager.py", line 415, in _map_promise
    res = utils.map_nested(_wait_on_promise, all_promises)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 145, in map_nested
    for k, v in data_struct.items()
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 145, in <dictcomp>
    for k, v in data_struct.items()
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 159, in map_nested
    return function(data_struct)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/download_manager.py", line 399, in _wait_on_promise
    return p.get()
  File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 512, in get
    return self._target_settled_value(_raise=True)
  File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 516, in _target_settled_value
    return self._target()._settled_value(_raise)
  File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 226, in _settled_value
    reraise(type(raise_val), raise_val, self._traceback)
  File "/home/forrest/.local/lib/python3.6/site-packages/six.py", line 696, in reraise
    raise value
  File "/home/forrest/.local/lib/python3.6/site-packages/promise/promise.py", line 844, in handle_future_result
    resolve(future.result())
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/forrest/.local/lib/python3.6/site-packages/tensorflow_datasets/core/download/extractor.py", line 96, in _sync_extract
    raise ExtractError(msg)
tensorflow_datasets.core.download.extractor.ExtractError: Error while extracting /home/forrest/tensorflow_datasets/downloads/ucexport_download_id_1K2kDBTNXS2ikx9xKmi2_Lls0d7kG6NhLh-6Qt4tNUbCaluqBYTWYYmp4CluxakNVBI4 to /home/forrest/tensorflow_datasets/downloads/extracted/ZIP.ucexport_download_id_1K2kDBTNXS2ikx9xKmi2_Lls0d7kG6NhLh-6Qt4tNUbCaluqBYTWYYmp4CluxakNVBI4 (file: arxiv-release/train.txt) : Bad CRC-32 for file 'arxiv-release/train.txt'

Eshan-Agarwal commented 4 years ago

@Conchylicultor I manually download data and running above script

path = '/path/to/file.zip' with tf.io.gfile.open(path, 'rb') as fobj: z = zipfile.ZipFile(fobj) for member in z.infolist(): extract_file = z.open(member) print(member.filename) I got AttributeError: module 'tensorflow_core._api.v2.io.gfile' has no attribute 'open' so i checked and replace tf.io.gfile.open with tf.io.gfile.Gfile and it worked.

After running above script on manually download scientific_papers data, Output is this

pubmed-release/ pubmed-release/train.txt MACOSX/ MACOSX/pubmed-release/ MACOSX/pubmed-release/._train.txt pubmed-release/vocab __MACOSX/pubmed-release/._vocab pubmed-release/test.txt MACOSX/pubmed-release/._test.txt pubmed-release/val.txt __MACOSX/pubmed-release/._val.txt __MACOSX/._pubmed-release

means it runs successfully but got BAD-CRC-32 error when using tfds.load("scientific_papers") So is there is anything we can change in iter_zip function in core/extractor.py.

If it is I like to work on it.

Conchylicultor commented 4 years ago

You should use tf.io.gfile.GFile instead of open

Eshan-Agarwal commented 4 years ago

You should use tf.io.gfile.GFile instead of open

Yes I am using tf.io.gfile.GFile , but got error BAD-CRC-32 when not download data manually instead use tfds.load().

Eshan-Agarwal commented 4 years ago

@Conchylicultor I think there is nothing wrong with anything its just train.txt files in archives is corrupted. When I download data and delete train.txt from it. Everything runs fine.

Conchylicultor commented 4 years ago

@Eshan-Agarwal if you understand the bug and know a way to fix it, then please send a PR if you have time. I'm not sure I understand what the bug is.

Eshan-Agarwal commented 4 years ago

Okay I will try.

Eshan-Agarwal commented 4 years ago

@Conchylicultor I found that in scientific_papers dataset zip file train.txt is corrupted so it raises this BAD-CRC-32error but I can extract data using Winrar app. So I think there is two options to resolve this issue one is raises more efficient exception that data is corrupted and other I can upload it to drive so and update scientific_papers.pyand checksums file. So what should I do ?

Conchylicultor commented 4 years ago

It seems someone got the same issue with the same url: https://stackoverflow.com/questions/56204755/zipfile-extractall-raising-badzipfile-bad-crc-32-for-file-error

I'm wondering if this only happens for a specific version of Python/zip lib because it seems extraction worked before.

other I can upload it to drive so and update

We are not allowed to redistribute datasets from other ourselves. Thank you for contacting the original authors. Hopefully they will update their file.

Eshan-Agarwal commented 4 years ago

@Conchylicultor Thanks for helping , hope they update data soon.

Eshan-Agarwal commented 4 years ago

@Conchylicultor They update data, I send you PR today with updated links and checksums

Conchylicultor commented 4 years ago

Thank you fox fixing this

jieralice13 commented 4 years ago

@forrestbao May I ask how you successfully downloaded big_patent using tfds.load? I tried ds = tfds.load('big_patent'), but it threw out an error as follows:

ValueError: The dataset you're trying to generate is using Apache Beam. Beam datasets are usually very large and should be generated separately.Please have a look athttps://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_datasetfor instructions.

I tried using apache beam but I got OOM on my server. #2146

Thanks!

forrestbao commented 3 years ago

@jieralice13 I think the problem has been fixed now.

tensorflow / datasets

Bad CRC-32 when extracting scientific_papers #1404