imageio / imageio

Python library for reading and writing image data
https://imageio.readthedocs.io
BSD 2-Clause "Simplified" License
1.5k stars 295 forks source link

SeekableFileObject needs `readline()` for Pillow plugin #1007

Open Dotrar opened 1 year ago

Dotrar commented 1 year ago

Hey all,

I've found an issue, and it's stopping imageio from trying every plugin and eventually saying a certain image not supported.

I'm trying to open a url of a larger JPG from an s3 storage. It's unable to ascertain the file-type so that it goes for the "try every plugin" approach:

  1. in imopen, if we can't find the resource, we will "try everything" here

  2. We use Request which will return a SeekableFileObject as shown here

  3. Using Pillow ImImagePlugin to read a file, we attempt a readline as seen here @ Pillow

But SeekableFileObject does not have readline, causing an error. https://github.com/imageio/imageio/blob/master/imageio/core/request.py#L653

image

Dotrar commented 1 year ago

forgot to mention, adding in

def readline(self):
    return self.read(100)

fixes the crash and allows imageio to open my image, by the way.

FirefoxMetzger commented 1 year ago

I'm trying to open a url of a larger JPG from an s3 storage. It's unable to ascertain the file-type

How come?

Assuming you pass a URL and the file either ends with a JPEG suffix (e.g. .jpg) then ImageIO will try plugins known to read JPEG first and only fall back to trying everything if the "known-to-read-jpeg" plugins fail. If the URL doesn't end with a JPEG suffix, but you know it to be JPEG, you can also pass extension=".jpg" as an additional kwarg to imopen, imread, etc. to set/overwrite the suffix of the resource itself.

it goes for the "try every plugin" approach:

If you already know you want to read using pillow you can also pass plugin="pillow" as an additional kwarg, in which case plugin selection is skipped in favour of using the plugin you specified.

But SeekableFileObject does not have readline, causing an error.

Since SeekableFileObject is a binary file and lacks the notion of a "line of text" or an "EOL character (\n)". As such the function reads a single line from the file doesn't make much sense in this context and we didn't implement readline.

Using Pillow ImImagePlugin to read a file, we attempt a readline as seen here @ Pillow

The plugin you are quoting here is the plugin for reading IFUNC Image Memory not JPEG. I have actually never encountered this format and Google and the pillow repo are suspiciously silent about it. I suspect that this is a legacy format that is no longer in active use and that should probably not be used further.

def readline(self):
   return self.read(100)

The problem with setting readline to read the first 100 bytes is that this is an arbitrary cutoff. This may or may not mean something for the underlying data of the stream and will most likely leave the head at an awkward and unexpected location.

I am still not convinced we want to add readline, but were we to do so we probably want to read the entire file. In my mind "read until you encounter the end of the current line" plus "there is no end of line for binary files" results in "read until the end of the file".


Regarding the actual problem of ImageIO being unable to read a large JPEG from S3, are you able to share a code snippet that reproduces the failure?

Walking over all plugins and trying them out is expensive and more of a last resort than something we want to do on a regular basis. If we can avoid this by doing something smart during plugin selection I'd be quite interested in this option.