qld-gov-au / ckanext-s3filestore

Use Amazon S3 as a filestore for CKAN
GNU Affero General Public License v3.0
7 stars 4 forks source link

XLSX files are not detected correctly #87

Closed jeverling closed 2 years ago

jeverling commented 2 years ago

Hi all, XLSX files are detected as Zip files when ckanext-s3filestore is active. The reason is, that the extension only looks at the first 512 bytes of the uploaded file: https://github.com/qld-gov-au/ckanext-s3filestore/blob/cf0c5bd/ckanext/s3filestore/uploader.py#L545

The part that differentiates a XLSX from a regular Zip comes later (when, depends on the file as well):

In [1]: import magic
In [2]: mime = magic.Magic(mime=True)
In [3]: f = open("empty.xlsx", "rb")
In [4]: mime.from_buffer(f.read(512))
Out[4]: 'application/zip'

In [5]: f.seek(0)
Out[5]: 0

In [6]: mime.from_buffer(f.read(1000))
Out[6]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

Since the amount of bytes you have to read before python-magic determines it's a XLSX changes with XLSX size/complexity, you probably have to pass the whole file to be sure it works reliably.

CKAN by default tries to look at the file extension to determine the mimetype (config ckan.mimetype_guess = file_ext), or reads the whole file (ckan.mimetype_guess = file_contents): https://github.com/ckan/ckan/blob/0a596b8/ckan/lib/uploader.py#L274

What do you think is the best way to fix this? To behave the same way as CKAN does, or stick to magic.from_buffer but read the whole file? Performance wise that probably won't do any harm. Or use magic.from_file/from_descriptor?

ThrawnCA commented 2 years ago

Ah, thanks for identifying this. We've had a few similar reports, mostly involving DOCX files. Do you find that re-uploading the same file resolves it?

Perhaps we could use magic.from_file first, since it's fast and should work best for recognised types, then type sniffing as a backup if the filename isn't recognised.

jeverling commented 2 years ago

Hm no, re-uploading doesn't seem to change anything.

Perhaps we could use magic.from_file first, since it's fast and should work best for recognised types, then type sniffing as a backup if the filename isn't recognised.

Sounds good to me!

ThrawnCA commented 2 years ago

I think that re-uploading works for us because we're also using https://github.com/qld-gov-au/ckanext-resource-type-validation/

This fix will likely be prioritised, as it's affecting our clients.

ThrawnCA commented 2 years ago

https://github.com/qld-gov-au/ckanext-s3filestore/releases/tag/0.7.7-qgov.2 should fix this.

jeverling commented 2 years ago

Thanks a lot! I can confirm it works! In case this is useful for others: the XLS icon was still not displayed for me, and format was still set to application/zip, even though the mimetype was set correctly now. Turns out the Docker image didn't have mime-types configured, so while python-magic recognized the mimetype correctly now, the code that sets the format (validators.py#L799) couldn't detect the mimetype. Installing the mime-support package from the debian repositories in the image solved that issue for me.

jeverling commented 2 years ago

BTW, another issue for us is that XLSX files aren't pushed to the CKAN DataStore (anymore). This seems to be due to xlrd removing support for XLSX files: https://github.com/ckan/datapusher/issues/232 xlrd is used by messytables, which is also not maintained anymore. We switched to https://github.com/ckan/ckanext-xloader from datapusher, but it has the same problem. There is a successor to messytables called frictionless, probably the best way would be to switch to that.

Are you using datapusher or xloader to submit tabular data to the DataStore?

ThrawnCA commented 2 years ago

We use XLoader. I see you've raised this there, at https://github.com/ckan/ckanext-xloader/issues/161

As far as I can see, it's not related to ckanext-s3filestore? But if you want to coordinate with us about it, we do have an XLoader fork, https://github.com/qld-gov-au/ckanext-xloader

jeverling commented 2 years ago

It's not related to ckanext-s3filestore at all. Thanks, I will check out your fork of XLoader.