tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.31k stars 1.55k forks source link

Corrupt files in the `dogs_vs_cats` dataset #2188

Closed tomerh2001 closed 1 year ago

tomerh2001 commented 4 years ago

Short description I encountered this bug during my TensorFlow certification exam, when trying to work with images from the dataset you constantly get the message Corrupt JPEG data: 228 extraneous bytes before marker 0xd9 again and again, and it takes forever to iterate over the data once with that, I couldn't complete my exam because of that.

Environment information

Reproduction instructions A very simple way to reproduce the bug:

dataset_name = 'cats_vs_dogs'
dataset, info = tfds.load(name=dataset_name, 
                          split=tfds.Split.TRAIN,
                          with_info=True)

for i in dataset:
    print(i)

Expected behavior I except to be able to iterate over all the images without getting errors and without it taking forever to complete a single iteration.

vijayphoenix commented 4 years ago

@tomergt45 I am unable to reproduce the bug.

As far as I can see, all corrupt images are removed already.

https://github.com/tensorflow/datasets/blob/921c0f86b8eeba863ce0af6523f34ac75d3d7529/tensorflow_datasets/image_classification/cats_vs_dogs.py#L104

And printing is an I/O operation, so time it will a lot of time to print of array of 20000+ images.

tomerh2001 commented 4 years ago

The problem happens when you just iterate over the data (without printing):

>>> import tensorflow_datasets as tfds
>>> tfds.__version__
'3.1.0'
>>> dataset = tfds.load('cats_vs_dogs')
>>> dataset = [i for i in dataset['train']]
Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Warning: unknown JFIF revision number 0.00
Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Corrupt JPEG data: 239 extraneous bytes before marker 0xd9

It also happens when you try to fit a model with this data.

Eshan-Agarwal commented 4 years ago

@tomergt45 I will not able to reproduce the error. @vijayphoenix is correct, all 1738 corrupt images were skipped, see this colab.

tomerh2001 commented 4 years ago

@Eshan-Agarwal It still happens to me, I tried updating tfds to version 3.2.0 but I still get the same messages, any idea why or how I can fix it? :/

full example:

>>> import tensorflow_datasets as tfds
2020-07-23 18:13:21.705749: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
>>> tfds.__version__
'3.2.0'
>>> dataset = tfds.load('cats_vs_dogs')
2020-07-23 18:13:46.057890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-23 18:13:46.157089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-07-23 18:13:46.168293: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:13:46.224450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:13:46.267127: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:13:46.289066: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:13:46.338939: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:13:46.368177: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:13:46.444874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:13:46.451455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 18:13:46.458140: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-23 18:13:46.490376: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x23027dcb440 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-23 18:13:46.497384: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-23 18:13:46.503795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.665GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2020-07-23 18:13:46.514591: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-07-23 18:13:46.519846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-23 18:13:46.525979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-23 18:13:46.532067: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-23 18:13:46.539066: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-23 18:13:46.545074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-23 18:13:46.551015: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-23 18:13:46.556295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-07-23 18:13:48.573072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-23 18:13:48.578039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-07-23 18:13:48.582122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-07-23 18:13:48.586650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8589 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-07-23 18:13:48.602605: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x230518534a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-23 18:13:48.609512: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
>>> dataset = [i for i in dataset['train']]
Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Warning: unknown JFIF revision number 0.00
Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
vijayphoenix commented 4 years ago

However, I can reproduce this issue on windows.

AswathKiruba commented 4 years ago

I encountered this bug during my TensorFlow certification exam yesterday.

Conchylicultor commented 4 years ago

It is possible that the code to auto-detect corrupted images do not works on windows: https://github.com/tensorflow/datasets/blob/921c0f86b8eeba863ce0af6523f34ac75d3d7529/tensorflow_datasets/image_classification/cats_vs_dogs.py#L94

Or maybe there are additional corrupted images on windows that works on linux ?

Unfortunately, I do not have access to any windows computer, so I'm can't really debug this. If someone want to help us investigate this, it would be great.

tomerh2001 commented 4 years ago

@Conchylicultor I tried checking it out, I added some print calls in each function of the CatsVsDogs class, and when running this code:

import tensorflow_datasets as tfds
dataset = tfds.load('cats_vs_dogs')
dataset = [i for i in dataset['train']]

Only the print in the _info function was called, I may be missing something, but perhaps the line you referenced earlier isn't being executed?

Edit: I'd like to point out I am not very familiar with how the TensorFlow Datasets API is structured.

Conchylicultor commented 4 years ago

@tomergt45 Thanks for looking into this. The generation code is only executed once the first time the dataset is generated, afterward, the generated files are reused. To force executing the generation, you can delete the existing generated files (in ~/tensorflow_datasets/cats_vs_dogs/).

tomerh2001 commented 4 years ago

After investgiating a bit, I managed to get the names of the corrupted images that was not skipped using this code:

import tensorflow_datasets as tfds
import py, sys

dataset = tfds.load('cats_vs_dogs')

capture = py.io.StdCaptureFD()

corrupt_images = []
for x in dataset['train']:
    _, err = capture.readouterr()    
    if err:
        corrupt_images.append(x['image/filename'].numpy().decode())

which gave me the following output:

'PetImages\\Dog\\10880.jpg',
'PetImages\\Dog\\164.jpg',
'PetImages\\Cat\\11279.jpg',
'PetImages\\Dog\\11124.jpg',
'PetImages\\Dog\\621.jpg',
'PetImages\\Cat\\497.jpg',
'PetImages\\Cat\\8051.jpg',
'PetImages\\Dog\\6754.jpg',
'PetImages\\Dog\\3176.jpg',
'PetImages\\Cat\\9813.jpg',
'PetImages\\Cat\\10838.jpg',
'PetImages\\Dog\\4956.jpg'

Hope this helps.

EDIT: That's very weird but every time you execute this code you get diffrent file names, I'm not sure why.

vijayphoenix commented 3 years ago

It is very like that this is because of the following: tf.io.gfile with python zipfile results in corruption of the data. (For some reason Windows only) Similar issue #2539

For more info tensorflow/tensorflow#32975

neoh1 commented 3 years ago

I encountered this today while training a VGG model using Cubbli/Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-126-generic x86_64) tensorflow 2.3.0 (tensorflow-gpu) tensorflow_datasets (4.2.0) (installed with pip) python 3.7 in anaconda3 environment dataset = tfds.load('cats_vs_dogs', split=tfds.Split.TRAIN, data_dir='data/').

224/727 [========>.....................] - ETA: 41s - loss: 0.6927 - accuracy: 0.5188Corrupt JPEG data: 99 extraneous bytes before marker 0xd9 261/727 [=========>....................] - ETA: 38s - loss: 0.6925 - accuracy: 0.5247Warning: unknown JFIF revision number 0.00 273/727 [==========>...................] - ETA: 37s - loss: 0.6923 - accuracy: 0.5250Corrupt JPEG data: 396 extraneous bytes before marker 0xd9 317/727 [============>.................] - ETA: 33s - loss: 0.6916 - accuracy: 0.5286Corrupt JPEG data: 162 extraneous bytes before marker 0xd9 365/727 [==============>...............] - ETA: 29s - loss: 0.6907 - accuracy: 0.5312Corrupt JPEG data: 252 extraneous bytes before marker 0xd9 366/727 [==============>...............] - ETA: 29s - loss: 0.6907 - accuracy: 0.5312Corrupt JPEG data: 65 extraneous bytes before marker 0xd9 382/727 [==============>...............] - ETA: 28s - loss: 0.6905 - accuracy: 0.5332Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9 541/727 [=====================>........] - ETA: 15s - loss: 0.6876 - accuracy: 0.5492Corrupt JPEG data: 214 extraneous bytes before marker 0xd9 644/727 [=========================>....] - ETA: 6s - loss: 0.6851 - accuracy: 0.5580Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9 661/727 [==========================>...] - ETA: 5s - loss: 0.6846 - accuracy: 0.5594Corrupt JPEG data: 128 extraneous bytes before marker 0xd9 675/727 [==========================>...] - ETA: 4s - loss: 0.6841 - accuracy: 0.5607Corrupt JPEG data: 239 extraneous bytes before marker 0xd9 711/727 [============================>.] - ETA: 1s - loss: 0.6834 - accuracy: 0.5625Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9 719/727 [============================>.] - ETA: 0s - loss: 0.6832 - accuracy: 0.5624Corrupt JPEG data: 228 extraneous bytes before marker 0xd9

owahlen commented 2 years ago

Similar problem: Ubuntu 20.04.3 LTS 5.13.0-27-generic x86_64 python 3.9 anaconda environment tensorflow 2.7.0 (installed with pip) tensorflow_datasets 4.5.2 (installed with pip)

import tensorflow_datasets as tfds
dataset = tfds.load('cats_vs_dogs', split=tfds.Split.TRAIN)
list(dataset.as_numpy_iterator())

gives

2022-02-01 00:39:04.607453: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-01 00:39:05.039021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8267 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:42:00.0, compute capability: 7.5
Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
Warning: unknown JFIF revision number 0.00
Corrupt JPEG data: 396 extraneous bytes before marker 0xd9
Corrupt JPEG data: 162 extraneous bytes before marker 0xd9
Corrupt JPEG data: 252 extraneous bytes before marker 0xd9
Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1403 extraneous bytes before marker 0xd9
Corrupt JPEG data: 214 extraneous bytes before marker 0xd9
Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
Corrupt JPEG data: 128 extraneous bytes before marker 0xd9
Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
Tropaion commented 2 years ago

So, is there some solution to this? I have the same problem on latest Debian11 and Windows11 with latest software.

Fatichti commented 2 years ago

Hi

No solution have been found now ?

jchwenger commented 2 years ago

Same here, on Ubuntu 18.

Tropaion commented 2 years ago

The only thing that worked for me was using this software to filter the images: https://github.com/coderslagoon/BadPeggy

Alvov1 commented 1 year ago

Same on macOS Ventura :(

LeandroLosaria commented 1 year ago

2+ years later, still being an issue.

I'm running the TensorFlow: Advanced Techniques Specialization Coursera Course 1 Week 4 quiz.

My env is

TorrCod commented 1 year ago

I encounter this today.. any solution?

enigma6174 commented 1 year ago

I came across the same problem too. I had downloaded the dataset from Kaggle and tried running it on my local machine. But when I called model.fit() the training stopped with error.

My solution was to write a code to try and open files and if there is any error, remove the required file. Also, if the number of channels (or dimensions) in the image are not 3 (reed, green, blue channels) then also I will remove the file. After running this code on the dataset I was able to get the model to train without any issues.

My code:

from pathlib import Path
from tensorflow.io import read_file
from tensorflow.image import decode_image

# data_dir is of type Path and points to the parent dir
# parent dir contains the directories 'Dog' and 'Cat'
# run the same code for the dir 'Cat' to remove corrupt files 
for image in sorted((data_dir/'Dog').glob('*')):
    try:
        img = read_file(str(image))
        img = decode_image(img)

        if img.ndim != 3:
            print(f"[FILE_CORRUPT] {str(image).split('/')[-1]} DELETED")
            image.unlink()

    except Exception as e:
        print(f"[ERR] {str(image).split('/')[-1]}: {e} DELETED")
        image.unlink()
eclarson commented 1 year ago

I have seen a similar error in JPEG reading functions of several libraries, not just tensorflow, so I think this is an error in the underlying image decoding library employed. You can get around this issue by re-encoding and writing the JPEG images. It's an expensive operation, but you should only need to do it once.

I manipulated the image removal function provided for the dataset. On my machine, this fixed the Corrupt JPEG error. Note also that my directory name is "data/cats_dogs", which is different than the default directory name.

import os
import tensorflow as tf
from tensorflow.io import read_file, write_file
from tensorflow.image import decode_image

should_rewrite_image = True # set to true if you are getting Corrupt Data error
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join('data/cats_dogs', folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        is_jfif = True
        should_remove = False

        with open(fpath, "rb") as fobj:
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)

        try:
            img = read_file(fpath)
            if not tf.io.is_jpeg(img):
                should_remove = True

            img = decode_image(img)

            if img.ndim != 3:
                should_remove = True

        except Exception as e:
            should_remove = True

        if (not is_jfif) or should_remove:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)
        elif should_rewrite_image:
            tmp = tf.io.encode_jpeg(img)
            write_file(fpath, tmp)

print("Deleted %d images" % num_skipped)

Hope this helps others as a workaround.

andchir commented 1 year ago

Python 3.10:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import py
import tensorflow as tf
from tensorflow.io import read_file, write_file
from tensorflow.image import decode_jpeg

num_deleted = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join("PetImages", folder_name)
    for index, fname in enumerate(os.listdir(folder_path)):
        capture = py.io.StdCaptureFD()
        fpath = os.path.join(folder_path, fname)
        is_jfif = True
        should_remove = False
        try:
            fobj = open(fpath, "rb")
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
        finally:
            fobj.close()

        img_d = None
        try:
            img = read_file(fpath)
            if not tf.io.is_jpeg(img):
                should_remove = True

            img_d = decode_jpeg(img)

            if img_d.ndim != 3:
                should_remove = True

        except Exception as e:
            print('ERROR', fpath, str(e))
            should_remove = True

        _, err = capture.reset()
        if err and 'Corrupt JPEG data' in err:
            should_remove = True
            print('ERROR', fpath, err)

        if not is_jfif or should_remove:
            num_deleted += 1
            # Delete corrupted image
            os.remove(fpath)
monjoybme commented 1 year ago

A simple working code:

from tensorflow.io import read_file
from tensorflow.image import decode_image
import glob
import os
data_dir = '/data/Cat/*.jpg'
for image in sorted(glob.glob(data_dir)):
        img = read_file(str(image))
        img = decode_image(img)
        if img.shape[2] != 3:
           print(image)
           os.remove(image)