Closed jeverling closed 2 years ago
Ah, thanks for identifying this. We've had a few similar reports, mostly involving DOCX files. Do you find that re-uploading the same file resolves it?
Perhaps we could use magic.from_file
first, since it's fast and should work best for recognised types, then type sniffing as a backup if the filename isn't recognised.
Hm no, re-uploading doesn't seem to change anything.
Perhaps we could use magic.from_file first, since it's fast and should work best for recognised types, then type sniffing as a backup if the filename isn't recognised.
Sounds good to me!
I think that re-uploading works for us because we're also using https://github.com/qld-gov-au/ckanext-resource-type-validation/
This fix will likely be prioritised, as it's affecting our clients.
Thanks a lot! I can confirm it works! In case this is useful for others: the XLS icon was still not displayed for me, and format was still set to application/zip
, even though the mimetype was set correctly now.
Turns out the Docker image didn't have mime-types configured, so while python-magic recognized the mimetype correctly now, the code that sets the format (validators.py#L799) couldn't detect the mimetype. Installing the mime-support
package from the debian repositories in the image solved that issue for me.
BTW, another issue for us is that XLSX files aren't pushed to the CKAN DataStore (anymore). This seems to be due to xlrd removing support for XLSX files: https://github.com/ckan/datapusher/issues/232 xlrd is used by messytables, which is also not maintained anymore. We switched to https://github.com/ckan/ckanext-xloader from datapusher, but it has the same problem. There is a successor to messytables called frictionless, probably the best way would be to switch to that.
Are you using datapusher or xloader to submit tabular data to the DataStore?
We use XLoader. I see you've raised this there, at https://github.com/ckan/ckanext-xloader/issues/161
As far as I can see, it's not related to ckanext-s3filestore? But if you want to coordinate with us about it, we do have an XLoader fork, https://github.com/qld-gov-au/ckanext-xloader
It's not related to ckanext-s3filestore at all. Thanks, I will check out your fork of XLoader.
Hi all, XLSX files are detected as Zip files when ckanext-s3filestore is active. The reason is, that the extension only looks at the first 512 bytes of the uploaded file: https://github.com/qld-gov-au/ckanext-s3filestore/blob/cf0c5bd/ckanext/s3filestore/uploader.py#L545
The part that differentiates a XLSX from a regular Zip comes later (when, depends on the file as well):
Since the amount of bytes you have to read before python-magic determines it's a XLSX changes with XLSX size/complexity, you probably have to pass the whole file to be sure it works reliably.
CKAN by default tries to look at the file extension to determine the mimetype (config
ckan.mimetype_guess = file_ext
), or reads the whole file (ckan.mimetype_guess = file_contents
): https://github.com/ckan/ckan/blob/0a596b8/ckan/lib/uploader.py#L274What do you think is the best way to fix this? To behave the same way as CKAN does, or stick to
magic.from_buffer
but read the whole file? Performance wise that probably won't do any harm. Or usemagic.from_file/from_descriptor
?