keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

autoclose files opened in load_img #261

Closed duhaime closed 4 years ago

duhaime commented 4 years ago

Summary

The current load_img function opens files but never closes them, which can cause OSError: [Errno 24] Too many open files if one reads in too many images in a process.

Related Issues

https://github.com/keras-team/keras-preprocessing/issues/260

PR Overview

duhaime commented 4 years ago

Thanks for your quick followup @rragundez! I had thought the same, but it seems closing the filehandler before returning the object forbids a number of operations on the returned object. Minimal example:

import numpy as np
from PIL import Image as pil_image

im = pil_image.open('cats.jpg')
im.close()
im.getdata()

That will throw ValueError: Operation on closed image.

For what it's worth, the approach here is straight from the Pillow docs on file handling. That said, I'm happy to make any changes you see fit!

rragundez commented 4 years ago

Thanks for your explanations. After having a closer look I think the root of the problem is the inconsistency load_img can return only a pointer to the file if no data augmentation is done (keeps file open) and an actual image in memory if data augmentation is done (no open file). I think this inconsistency is gone by simply changing img = pil_image.open(path) to img = pil_image.load(path) in https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image/utils.py#L112

The drawback is of course that for no augmentation the image object occupies more memory but since this function is used in the Iterator class in a loop where it gets immediately converted to an array then is OK.

I would do the change above and simply add a unit-test that load_img is actually returning an image in memory and not a pointer to the file, with and without augmentation. What do you think?

duhaime commented 4 years ago

@rragundez Your approach (with pil_image.load(path) as img) is much cleaner, nice research. I just replaced the io.BytesIO business with that approach, and added a test to validate that load_img does indeed return an image in memory.

How does the updated PR look?

duhaime commented 4 years ago

@rragundez ah I was testing against a cached version of the module--it appears the load method is missing from PIL.Image--where did you see that method?

rragundez commented 4 years ago

It appears it acts over the Image class objects

https://pillow.readthedocs.io/en/3.1.x/reference/Image.html?highlight=load#PIL.Image.Image.load

then something like pil_image.open().load() I think is appropriate.

duhaime commented 4 years ago

Hmm, these two return different objects:

pil_image.open('a.png')        # returns `PIL.PngImagePlugin.PngImageFile` object
pil_image.open('a.png').load() # returns `PixelAccess` object 

I understand the Pillow docs on the open method to indicate that PIL.Image.open() opens the file (and never closes it) but waits to read the pixel data into RAM. load() evidently is a method that one can call to manually force the PIL.Image object to yield up its pixel data.

That said, combining open and load don't seem to close the file handlers. Returning PixelAccess data rather than a PIL Image object also breaks some methods in keras_preprocessing; e.g. you can't call img_to_array on the PixelAccess data:

>>> from keras.preprocessing.image import img_to_array
Using TensorFlow backend.
>>> from PIL import Image as pil_image
>>> a = pil_image.open('a.png').load()
>>> img_to_array(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/doug/anaconda/envs/3.5/lib/python3.5/site-packages/keras/preprocessing/image.py", line 75, in img_to_array
    return image.img_to_array(img, data_format=data_format, dtype=dtype)
  File "/Users/doug/anaconda/envs/3.5/lib/python3.5/site-packages/keras_preprocessing/image/utils.py", line 299, in img_to_array
    x = np.asarray(img, dtype=dtype)
  File "/Users/doug/anaconda/envs/3.5/lib/python3.5/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number, not 'PixelAccess'

It's entirely possible I misunderstood you or am missing something! From my perspective, using the io stream method outlined in the Pillow docs seems the best way forward...

duhaime commented 4 years ago

@rragundez sounds good! I just submitted some updates with the changes you requested. The flake8 checks are of course going to fail because of the undefined unicode in the text module. I factored the changes that rectify that problem into a separate PR per your request above.

rragundez commented 4 years ago

I think is good now, I'll have a closer look tomorrow :)

duhaime commented 4 years ago

thanks for your help with this!

rragundez commented 4 years ago

No problem, I don't understand why I still see changes in keras_preprocessing/text.py

rragundez commented 4 years ago

thanks for the work @duhaime!