wagtail / Willow

A wrapper that combines the functionality of multiple Python image libraries into one API
https://willow.wagtail.org/
BSD 3-Clause "New" or "Revised" License
274 stars 53 forks source link

Make image format detection and `SvgImageFile` initialisation more robust #144

Open zerolab opened 8 months ago

zerolab commented 8 months ago

https://pastebin.com/jNWdCZ2N

File "/usr/local/lib/python3.8/site-packages/willow/image.py", line 313, in __init__
cms-app-6b79f8884b-tmjzb cms-app     self.dom = ElementTree.parse(f)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/site-packages/defusedxml/ElementTree.py", line 130, in parse
cms-app-6b79f8884b-tmjzb cms-app     return _parse(source, parser)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
cms-app-6b79f8884b-tmjzb cms-app     tree.parse(source, parser)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", line 601, in parse
cms-app-6b79f8884b-tmjzb cms-app     parser.feed(data)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", 

Update: the issue seems to be with https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L82-L99 (and specifically https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87) where the user is uploading a PNG (allegedly)

Stormheg commented 8 months ago

@zerolab It would be helpful to have an image file that causes this bug.

zerolab commented 8 months ago

Yeah.. I have been trying.. https://wagtailcms.slack.com/archives/C81FGJR2S/p1709908265029289

I suspect they are uploading images without an extension, and then there's potentially something with the system that filetype cannot determine the file so Willow thinks it may be svg-ish

zerolab commented 8 months ago

OK, so finally got confirmation. It was images without an extension, so https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87 is kicking in

Nigel2392 commented 8 months ago

Honestly - does it matter if we have a file that fails?

I thought he mentioned there was a file extension visible in the database. (his examples also show extension)

We should probably fallback on extensions after checking the mimetype; and then try to guess if it's SVG, no? Seems like a generally good improvement to me.

nqcm commented 8 months ago

We have a custom model inheriting from AbstractImage:

class Asset(AbstractImage):
    file = models.ImageField(
        verbose_name=_('file'),
        upload_to=settings.ASSET_UPLOAD_PREFIX,
        width_field='width',
        height_field='height',
        storage=MediaStorage()
    )

MediaStorage here is using django-gcloud-storage

When I upload an image I get the following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

Which is weird as I am uploading a png or jpeg image.I tried different images on different environments so this issue is not arising from a corrupted image file.

After further investigation and @zerolab and @Nigel2392 think the problem is arising when willow is guessing the wrong file type /extension in https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87

The code blame shows that this part of the code was added last year. I am upgrading from Wagtail 4.2 to Wagtail 5, which would explain why it never occurred before.

When I run the following:

import filetype
from wagtail.images import get_image_model

Image = get_image_model()

image = Image.objects.get(file="dev/chair-unsplash.jpg")
with image.open_file() as image_file:
    ext = filetype.guess_extension(image_file)
    print(f"Image {image.pk} {image.file} has extension {ext}")

I get Image 149277 dev/chair-unsplash.jpg has extension None

My guess is that it is unable to guess the correct file type for the BLOB when opening a file from gcloud storage.

nqcm commented 8 months ago

I still get the issue even if the extension column in DB is jpeg. The file itself has the extension as well.

nqcm commented 8 months ago

Hi, I am doing some more investigation and here's my findings

In [4]: image = Image.objects.get(file="cms-dev/chair-unsplash_TAJSD0d.jpg")

In [5]: image.file
Out[5]: <ImageFieldFile: cms-dev/chair-unsplash_TAJSD0d.jpg>

When I try to get the path I get unimplemented error, which is understandable as it is being read from Gcloud into a temp fle:

In [11]: image.file.path
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 image.file.path

File /usr/local/lib/python3.10/site-packages/django/db/models/fields/files.py:59, in FieldFile.path(self)
     56 @property
     57 def path(self):
     58     self._require_file()
---> 59     return self.storage.path(self.name)

File /usr/local/lib/python3.10/site-packages/django/core/files/storage.py:128, in Storage.path(self, name)
    122 def path(self, name):
    123     """
    124     Return a local filesystem path where the file can be retrieved using
    125     Python's built-in open() function. Storage systems that can't be
    126     accessed using open() should *not* implement this method.
    127     """
--> 128     raise NotImplementedError("This backend doesn't support absolute paths.")

NotImplementedError: This backend doesn't support absolute paths.
  1. When using the open_file method I get the extension is None:
In [21]: with image.open_file() as image_file:
    ...:     ext = filetype.guess_extension(image_file)
    ...:     print(f"Image {image.pk} {image.file} has extension {ext}")
    ...:
[24/Mar/2024 11:28:35] | urllib3.connectionpool._get_conn:292 | DEBUG | Resetting dropped connection: storage.googleapis.com
[24/Mar/2024 11:28:35] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?projection=noAcl&prettyPrint=false HTTP/1.1" 200 839
[24/Mar/2024 11:28:36] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /download/storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?generation=1710246298001500&alt=media HTTP/1.1" 200 323686
Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension None
  1. When I open it using default_storage.open method I am getting the extension correctly:
In [22]: with default_storage.open(image.file.name, 'rb') as file:
    ...:     extension = filetype.guess_extension(file.read())
    ...:     print(f"Image {image.pk} {image.file} has extension {extension}")
    ...:
[24/Mar/2024 11:32:10] | urllib3.connectionpool._get_conn:292 | DEBUG | Resetting dropped connection: storage.googleapis.com
[24/Mar/2024 11:32:11] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?projection=noAcl&prettyPrint=false HTTP/1.1" 200 839
[24/Mar/2024 11:32:11] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /download/storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?generation=1710246298001500&alt=media HTTP/1.1" 200 323686
Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension jpg
  1. I also get the extension if I do the following:
storage = image.file.storage

with storage.open(image.file.name, 'rb') as file:
    extension = filetype.guess_extension(file.read())
    print(f"Image {image.pk} {image.file} has extension {extension}")

# Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension jpg

Does that mean that the open_file method is not reading the file correctly?

nqcm commented 8 months ago

Just an update, I was able to solve the issue by making changes to the storage class. The package I was using was downloading the image file from Gcloud and using a SpooledTemporaryFile to store it temporarily. See the code.

I changed it o use a NamedTemporaryFile and willow's open_file method started to read the mime type correctly.

May be someone can add compatibility in Willow for SpooledTemporaryFile.

But thanks for all your time and help. Really appreciate it.