tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.32k stars 1.55k forks source link

Unknown error appears when I use the UCF101 dataset. Perhaps some bug exists in the file extractor.py. #2539

Open Acebee opened 4 years ago

Acebee commented 4 years ago

Short description when I try to use the UCF101 dataset ,the program report something like this tensorflow.python.framework.errors_impl.OutOfRangeError: E:\tfdsdata\datasets\ucf\downloads\thumos14_files_UCF101_videosxm55JXkGdBSDxwckqpN5c7GNr_LXm9dTyoJdpxR_aas.zip; Unknown error

Environment information

Reproduction instructions

mnist_train = tfds.load(name="ucf101", data_dir="E:\\tfdsdata\\datasets\\ucf")

or just reproduce the problem like this:

#something.zip refers to any zipFile
import tensorflow.compat.v2 as tf
with tf.io.gfile.GFile('E:\\tfdsdata\\datasets\\ucf\\downloads\\something.zip', 'r') as f_obj:
    z = zipfile.ZipFile(f_obj)

Link to logs

Expected behavior I looked into the extractor.py file and fond the reason: It seems that when zipfile.ZipFile() trys to unzip a file which is wrapped by tf.io.gfile.GFile, it throws an exception.

@contextlib.contextmanager
def _open_or_pass(path_or_fobj):
  if isinstance(path_or_fobj, six.string_types):
    with tf.io.gfile.GFile(path_or_fobj, 'rb') as f_obj:
      yield f_obj
  else:
    yield path_or_fobj

I manage to solve this problem by trying to not use the wrapped file. something like this:

...
def iter_zip(arch_f):
  """Iterate over zip archive."""
  with _open_or_pass(arch_f) as fobj:
    ########
    z = zipfile.ZipFile(fobj)#change this
    ########
    for member in z.infolist():
      extract_file = z.open(member)
      if member.is_dir():  # Filter directories  # pytype: disable=attribute-error
        continue
      path = _normpath(member.filename)
      if not path:
        continue
      yield [path, extract_file]
def iter_zip(arch_f):
  """Iterate over zip archive."""
  with _open_or_pass(arch_f) as fobj:
    ########
    z = zipfile.ZipFile(arch_f)
    ########
    for member in z.infolist():
      extract_file = z.open(member)
      if member.is_dir():  # Filter directories  # pytype: disable=attribute-error
        continue
      path = _normpath(member.filename)
      if not path:
        continue
      yield [path, extract_file]

Additional context Add any other context about the problem here.

vijayphoenix commented 3 years ago

Thanks for reporting. It seems that using tf.io.gfile with python zipfile results in corruption of the data. (For some reason Windows only)

Related https://github.com/tensorflow/tensorflow/issues/32975

parasol4791 commented 3 years ago

I am actually seeing the same issue on Ubuntu 20.04 with the same dataset 'cats_vs_dogs'. Fine-tuning EfficientNet B4, getting similar errors. A separate investigation showed 'corrupted' file names change on every run.
Epoch 1/20 75/234 [========>.....................] - ETA: 43s - loss: 0.3529 - accuracy: 0.9261Corrupt JPEG data: 99 extraneous bytes before marker 0xd9 119/234 [==============>...............] - ETA: 31s - loss: 0.3379 - accuracy: 0.9352Corrupt JPEG data: 65 extraneous bytes before marker 0xd9 205/234 [=========================>....] - ETA: 7s - loss: 0.3345 - accuracy: 0.9424Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9 215/234 [==========================>...] - ETA: 5s - loss: 0.3345 - accuracy: 0.9428Corrupt JPEG data: 239 extraneous bytes before marker 0xd9 227/234 [============================>.] - ETA: 1s - loss: 0.3343 - accuracy: 0.9434Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9 229/234 [============================>.] - ETA: 1s - loss: 0.3343 - accuracy: 0.9435Corrupt JPEG data: 228 extraneous bytes before marker 0xd9 234/234 [==============================] - ETA: 0s - loss: 0.3342 - accuracy: 0.9437Corrupt JPEG data: 65 extraneous bytes before marker 0xd9