pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.17k stars 6.95k forks source link

torchvision.io.read_image does not always fail gracefully #3613

Closed ghost closed 3 years ago

ghost commented 3 years ago

🐛 Bug

torchvision.io.read_image() will sometimes segfault or abort in other uncatchable ways on malformed images, rather than failing gracefully (e.g. with a RuntimeError).

To Reproduce

Steps to reproduce the behavior:

  1. Download a problematic image file (one that I have found is here)
  2. Try to load the image with torchvision.io.read_image:
    >>> import torchvision
    >>> image = torchvision.io.read_image("283xnnabju4z.png")
    libpng warning: iCCP: known incorrect sRGB profile
    munmap_chunk(): invalid pointer
    Aborted (core dumped)

Expected behavior

I expected that trying to read an unsupported or malformed image would instead raise a RuntimeError or other catchable error so that it could be handled in code, rather than aborting.

Environment

PyTorch version: 1.8.1+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.20.0

Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1050 Nvidia driver version: 460.67 cuDNN version: /usr/local/cuda-10.2/lib64/libcudnn.so.7.6.4 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.20.1 [pip3] torch==1.8.1 [pip3] torchvision==0.9.1

Additional context

Something even more strange also happens with this particular image, which is that setting the mode to ImageReadMode.RGB will allow it to be read once, but attempting to read it a second time fails as above (i.e. torchvision.io.read_image is not idempotent). I'm not sure if this behavior is unrelated, but whatever the root cause is, it would be nice to be able to just catch an error, e.g. to log the filename and skip the image during processing.

>>> import torchvision
>>> image = torchvision.io.read_image("283xnnabju4z.png", mode=torchvision.io.image.ImageReadMode.RGB)
libpng warning: iCCP: known incorrect sRGB profile
>>> image.shape
torch.Size([3, 1410, 2048])
>>> image = torchvision.io.read_image("283xnnabju4z.png", mode=torchvision.io.image.ImageReadMode.RGB)
libpng warning: iCCP: known incorrect sRGB profile
munmap_chunk(): invalid pointer
Aborted (core dumped)

Some quick investigation shows that the problematic images that exhibit this behavior are usually PNGs with a depth of 16 bits. OpenCV and PIL do not appear to have problems reading them.

Additionally, the error message changes sometimes, e.g. to Segmentation fault or double free or corruption (out).

fmassa commented 3 years ago

Thanks for the report!

We will be looking into fixing this!

andfoy commented 3 years ago

Thanks for the information @apisutilis, I'll take a detailed look into this one!

andfoy commented 3 years ago

It seems like the error happens when the png reading function is trying to destroy the png reading structure after catching the error, that means that torchvision is catching the error, but it causes a segfault when calling png_destroy_read_struct on

https://github.com/pytorch/vision/blob/978ba613518dd5b8d04c2a717e6e5ccf7fb172c3/torchvision/csrc/io/image/cpu/decode_png.cpp#L35

Which in turn calls https://github.com/glennrp/libpng/blob/a37d4836519517bdce6cb9d956092321eca3e73b/pngread.c#L948, where png_free is an alias to free. Therefore this error is related to memory management. I checked if big_row_buf was NULL, but it wasn't.

In my reproduction scenario, torchvision was able to load the image once, but the second call caused the segfault and produced the message libpng error: IDAT: bad parameters to zlib. Which according to this issue https://github.com/ContinuumIO/anaconda-issues/issues/7315, it might be related to the version of zlib used when libpng is invoked. An user commented that the segfault occurred on the second call to libpng, which is the same scenario that we are having right now.

The proposed solution involves downgrading the zlib version (which I haven't verified myself). I'll try to compile ZLib as well as libpng to see if we can get more information.

fmassa commented 3 years ago

@andfoy did you have the chance to look at this again?

andfoy commented 3 years ago

@fmassa I haven't tried to compile Zlib locally, I'll give it a go tomorrow!

NicolasHug commented 3 years ago

Closing, since with #4101 torchvision will now fail gracefully.

@fmassa should we open another issue to keep track of the progress on support for pngs with more than 8 bits ?

fmassa commented 3 years ago

@NicolasHug yes, it would be good to have an issue to track supporting pngs with more than 8 bits.