wagtail / Willow

A wrapper that combines the functionality of multiple Python image libraries into one API
https://willow.wagtail.org/
BSD 3-Clause "New" or "Revised" License
273 stars 53 forks source link

Wrong filetype guessed #145

Closed Nigel2392 closed 6 months ago

Nigel2392 commented 6 months ago

Issue Summary

User is getting the following error:

Hello everyone, I am struggling with this issue I have after upgrading to Wagtail 5.2. The site works fine, except when I try to upload an image and then click on the images in side bar I get the following error:


 Internal Server Error: /admin/images/
 Traceback (most recent call last):
   File "/usr/local/lib/python3.8/site-packages/wagtail/images/models.py", line 424, in get_rendition
     rendition = self.find_existing_rendition(filter)
   File "/usr/local/lib/python3.8/site-packages/wagtail/images/models.py", line 479, in find_existing_rendition
     raise Rendition.DoesNotExist
 blog.models.AssetRendition.DoesNotExist
 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", line 1693, in feed
     self.parser.Parse(data, 0)
 xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0

The complete trace is too long to post here. I have searched on Google and on Github Wagtail issues. But apparently no one has had this issue. And I am at a complete loss as to what is happening here. Any help and/or leads in the right direction would be much appreciated. I have Django 3.2, and Python 3.8. I tried downgrading to Wagtail 5.0 but the issue persists.

This error likely originates from here:

https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87

when i run the following :
with image.open_file() as image_file:
    ext = filetype.guess_extension(image_file)
    print(f"Image {image.pk} {image.file} has extension {ext}")

I get Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension None although in the DB the extension column has the value jpeg Can you explain to me why is it guessing the extension and not using the onein the file name?

(mimetypes did guess the right type, hurahh)

>>> import mimetypes
>>> print(mimetypes.guess_type('/Users/naveera.muhammad/Downloads/chair-unsplash.jpg'))
('image/jpeg', None)

Steps to Reproduce

As of right now for me; I cannot reproduce it. I'm only here to suggest an improvement to keep this from happening in the future.

You can see we are trying to infer the filetype from the contents first; and then just checking if it's none and maybe XML. We should probably fallback on file extensions first. If someone tampers with that - errors are to be expected.

https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L82-L99

Technical details

Mentioned in quote. Don't know how long this will be available for; but 90 days should be enough to resolve this issue. Though it might be the reason for this issue report - we should have generally provided a better fallback IMO.

Slack Thread

nqcm commented 6 months ago

We have a custom model inheriting from AbstractImage:

class Asset(AbstractImage):
    file = models.ImageField(
        verbose_name=_('file'),
        upload_to=settings.ASSET_UPLOAD_PREFIX,
        width_field='width',
        height_field='height',
        storage=MediaStorage()
    )

MediaStorage here is using django-gcloud-storage

When I upload an image I get the following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

Which is weird as I am uploading a png or jpeg image.

After further investigation and help from Wagtail Slack channel I think the problem is arising when willow is guessing the wrong file type /extension in https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87

The code blame shows that this part of the code was added last year. I am upgrading from Wagtail 4.2 to Wagtail 5, which would explain why it never occurred before.

When I run the following:

import filetype
from wagtail.images import get_image_model

Image = get_image_model()

image = Image.objects.get(file="dev/chair-unsplash.jpg")
with image.open_file() as image_file:
    ext = filetype.guess_extension(image_file)
    print(f"Image {image.pk} {image.file} has extension {ext}")

I get Image 149277 dev/chair-unsplash.jpg has extension None

My guess is that it is unable to guess the correct file type for the BLOB when opening a file from gcloud storage.