python-pillow / Pillow

Python Imaging Library (Fork)
https://python-pillow.org
Other
12.29k stars 2.23k forks source link

"close" syscall run twice on same file descriptor when loading binary LZW-compressed TIFF #7042

Closed philip-bl closed 1 year ago

philip-bl commented 1 year ago

If I open an LZW-compressed binary TIFF file using pillow and load its data, pillow calls the "close" syscall on one of the two relevant file descriptors twice. The second call results in a "EBADF (Bad file descriptor)" error. The program doesn't crash and everything seems to work ok. But if multithreading is used and a race condition happens, this might crash the program.

In this example, the expected behaviour would be that there is only one "close" syscall on each of the two relevant file descriptors and that there is no "EBADF" error. Below, I show how to reproduce the problem.

$ cat create_tiff_file.py
#!/usr/bin/env python

import numpy as np
import PIL.Image as Image

Image.fromarray(np.random.rand(500, 600) > 0.9) \
    .save("example.tif", format="TIFF", compression="tiff_lzw")
$ cat read_tiff_file.py
#!/usr/bin/env python

import PIL.Image as Image

print("Right before the with-block", flush=True)
with Image.open("example.tif") as mask:
    print("Right in the beginning of the with-block", flush=True)
    mask.load()
    print("Right after load()", flush=True)
print("Right after the with-block", flush=True)
$ mamba create -n pillow_double_close_env --override-channels --channel anaconda python=3.11  # omitting the output
$ mamba activate pillow_double_close_env  # omitting the output
$ pip install numpy pillow  # omitting the output
$ conda list
# packages in environment at /home/philip/soft/mambaforge/envs/pillow_double_close_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    anaconda
_openmp_mutex             5.1                       1_gnu    anaconda
bzip2                     1.0.8                h7b6447c_0    anaconda
ca-certificates           2023.01.10           h06a4308_0    anaconda
certifi                   2022.9.24       py311h06a4308_0    anaconda
ld_impl_linux-64          2.38                 h1181459_1    anaconda
libffi                    3.4.2                h6a678d5_6    anaconda
libgcc-ng                 11.2.0               h1234567_1    anaconda
libgomp                   11.2.0               h1234567_1    anaconda
libstdcxx-ng              11.2.0               h1234567_1    anaconda
libuuid                   1.41.5               h5eee18b_0    anaconda
ncurses                   6.4                  h6a678d5_0    anaconda
numpy                     1.24.2                   pypi_0    pypi
openssl                   1.1.1s               h7f8727e_0    anaconda
pillow                    9.4.0                    pypi_0    pypi
pip                       22.2.2          py311h06a4308_0    anaconda
python                    3.11.0               h7a1cb2a_2    anaconda
readline                  8.2                  h5eee18b_0    anaconda
setuptools                65.5.0          py311h06a4308_0    anaconda
sqlite                    3.40.1               h5082296_0    anaconda
tk                        8.6.12               h1ccaba5_0    anaconda
tzdata                    2022a                hda174b7_0    anaconda
wheel                     0.37.1             pyhd3eb1b0_0    anaconda
xz                        5.2.10               h5eee18b_1    anaconda
zlib                      1.2.13               h5eee18b_0    anaconda
$ pip freeze
certifi @ file:///croot/certifi_1669749558946/work/certifi
numpy==1.24.2
Pillow==9.4.0
$ python create_tiff_file.py
$ strace -e decode-fds=path -e trace=openat,close ./read_tiff_file.py
# omitting all the output except for the few last lines, which are the most relevant
Right in the beginning of the with-block
close(4</pillow_double_close_bug_report/example.tif>) = 0
close(4)                                = -1 EBADF (Bad file descriptor)
close(3</pillow_double_close_bug_report/example.tif>) = 0
Right after load()
Right after the with-block
+++ exited with 0 +++

Also, I want to note that this caused my training of a neural network to crash once. It happened because Pytorch's DataLoader uses multithreading and a race condition happened.

nurumaik commented 1 year ago

Updated example that actually leads to bugs

create_tiff_file.py

#!/usr/bin/env python

import numpy as np
import PIL.Image as Image

# Create really big file so it will take time to load. But even bigger file will cause libtiff to complain that we are trying to zip-bomb it
Image.fromarray(np.random.rand(5000, 30000) > 0.9) \  
    .save("example.tif", format="TIFF", compression="tiff_lzw")

reproduce.py

#!/usr/bin/env python
from threading import Thread
import os
from collections import deque

finish = False

def load_files():
    descriptors = deque()
    while not finish:
        try:
            fd = os.open("test", os.O_RDONLY)
            descriptors.append(fd)
        except:  # in case of too many descriptors
            os.close(descriptors.pop())

    for d in descriptors:
        os.close(d)

    print("success")

t = Thread(target=load_files)

import PIL.Image as Image

print("Right before the with-block", flush=True)

with Image.open("example.tif") as mask:
    print("Right in the beginning of the with-block", flush=True)
    t.start()
    mask.load()  # replace it with time.sleep(10) and it works just fine without exceptions
    print("Right after load()", flush=True)
print("Right after the with-block", flush=True)
finish = True
t.join()

Steps to reproduce:

touch test
./create_tiff_file.py
./reproduce.py

Note: you need to delete example.tif and test before every iteration of running this. otherwise the OS might cache the files and thus the problem won't reproduce because libtiff will load file too fast

Expected behavior: load_files outputs "success"

Actual behavior:

Right in the beginning of the with-block
Right after load()
Right after the with-block
Exception in thread Thread-1 (load_files):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "reproduce.py", line 17, in load_files
    os.close(d)
OSError: [Errno 9] Bad file descriptor

Cause of bug

  1. tiff loader creates file duplicate of file descriptor for use in external c library code https://github.com/python-pillow/Pillow/blob/5832362215f5fa661fa05ad31e6c97aab87d410a/src/PIL/TiffImagePlugin.py#L1256

  2. As comments say "libtiff closes the file descriptor, so pass in a dup.", libtiff probably closes fp after decoding here https://github.com/python-pillow/Pillow/blob/5832362215f5fa661fa05ad31e6c97aab87d410a/src/PIL/TiffImagePlugin.py#L1298

  3. Pillow closes fp again https://github.com/python-pillow/Pillow/blob/5832362215f5fa661fa05ad31e6c97aab87d410a/src/PIL/TiffImagePlugin.py#L1308

If another thread decides to open new file between steps 2 and 3 (like load_files in example), it may get file descriptor that was just closed by libtiff. Then, in step 3, pillow closes this file, making descriptor invalid

nurumaik commented 1 year ago

This fixes the bug, but I'm not sure that libtiff will always close the descriptor, even if some error occurs

diff --git a/src/PIL/TiffImagePlugin.py b/src/PIL/TiffImagePlugin.py
index 3d4d0910a..d5ea99075 100644
--- a/src/PIL/TiffImagePlugin.py
+++ b/src/PIL/TiffImagePlugin.py
@@ -1296,6 +1296,8 @@ class TiffImageFile(ImageFile.ImageFile):
                 self.fp.seek(0)
             # 4 bytes, otherwise the trace might error out
             n, err = decoder.decode(b"fpfp")
+            # fp should be closed after call to decode
+            fp = None
         else:
             # we have something else.
             logger.debug("don't have fileno or getvalue. just reading")
radarhere commented 1 year ago

Because it's unclear if libtiff will close the file descriptor or not, what about this as an alternative solution? It checks if the file descriptor is open first, by trying to read 0 bytes.

I've tested it against create_tiff_file.py and reproduce.py and it works.

diff --git a/src/PIL/TiffImagePlugin.py b/src/PIL/TiffImagePlugin.py
index 3d4d0910a..97d184f5f 100644
--- a/src/PIL/TiffImagePlugin.py
+++ b/src/PIL/TiffImagePlugin.py
@@ -1305,6 +1305,9 @@ class TiffImageFile(ImageFile.ImageFile):

         if fp:
             try:
+                # Test if the file descriptor is still open
+                os.read(fp, 0)
+
                 os.close(fp)
             except OSError:
                 pass
nurumaik commented 1 year ago

I reproduced the problem with this fix. Unfortunately, reproduction code doesn't work in all cases, and I don't know how to make a better version without hooking something in libtiff. I can only suggest storing file "test" on some very slow drive to ensure read is really slow

How this can still lead to a bug:

  1. We call os.dup on self.fd and get descriptor, let's say 4
  2. decode finishes and calls close(4)
  3. Other thread opens another file, and os gives it the same number 4 as descriptor, since it's already freed
  4. We call os.read(4, 0) on file descriptor and it successfully reads that other file
  5. We close file that doesn't belong to us

I checked libtiff code and it should always close the descriptor after performing decode. It will stay open only if libtiff contains a bug (but this will probably lead to much more serious errors anyway)

https://github.com/python-pillow/Pillow/blob/main/src/libImaging/TiffDecode.c#L723

If libtiff may not close the fd in some cases, the proper fix would be to either:

  1. Somehow report back if it closed fd or not
  2. Disable fd closing in libtiff
radarhere commented 1 year ago

As to why we are closing the file pointer ourselves in the first place, it is because of #5936

wiredfool commented 1 year ago

As to why we're duping the file pointer in the firster place, I think the original logic was:

One option was to use python/our code for all the byte shuffling at the file level. IIRC there was a pretty decent performance and complexity improvement for cases when we did have a file pointer to be able to send that into libtiff and let it manage all of that.

(edit: yeah, that's definitely my comment style)

crazyyao0 commented 1 year ago

I met the same problem. Agree with @nurumaik. A workaround is changing the code from

im = Image.open(file_path)

to

im = Image.open(io.BytesIO(open(file_path, "rb").read()))
radarhere commented 1 year ago

@crazyyao0 do you have a different way to reproduce the bug that you could share in a self-contained example?

radarhere commented 1 year ago

To be clear, applying the suggested fix from https://github.com/python-pillow/Pillow/issues/7042#issuecomment-1489020934 does cause the scenario from #5936 to fail, which would otherwise have passed.

radarhere commented 1 year ago

I've created PR #7199 to stop duplicating the file pointer in the first place. Does that resolve this?

scskrause commented 1 month ago

I don't think this issue is fixed completely. I can still reproduce the "Bad file descriptor" error both on Windows and Linux in heavily multi-threaded applications, e.g. with the following script:

import argparse
import tempfile
import threading
from PIL import Image

parser = argparse.ArgumentParser()
parser.add_argument('filename')
args = parser.parse_args()

def test():
    with Image.open(args.filename) as im, tempfile.NamedTemporaryFile('wb') as f:
        im.save(f, format="TIFF", compression="tiff_lzw")

threads = []

for i in range(32):
    t = threading.Thread(target=test, daemon=True)
    t.start()
    threads.append(t)

for t in threads:
    t.join()

Run it with the attached example tiff (had to gzip it here so that Github would accept it) 3-pages.tiff.gz and you will see many Bad file descriptor error messages.

It seems that the save() method is responsible for the double fd close here.

radarhere commented 1 month ago

I pushed a commit to GitHub Actions, and saw no errors on Ubuntu or Windows. What version of Pillow and libtiff are you using?

scskrause commented 1 month ago

I tested it with the latest stable release 10.4.0 with the binary wheels from PyPI. Maybe increasing the number of threads could help?

scskrause commented 1 month ago

Since it's a race-condition, the issue doesn't happen on every run. On my local machine maybe half the time. This is what repated runs look like on Windows with Python 3.12 and Pillow 10.4.0:

tiff

radarhere commented 1 month ago

I've created #8458 to address this.

Here are some wheels for you to test if you would like to verify the fix - wheels.zip

scskrause commented 1 month ago

Looks good to me, I tested with the provided pillow-11.0.0.dev0-cp312-cp312-win_amd64.whl and couldn't reproduce the issue anymore.