openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Incorrect MIME type for EPUB directory entries in recent Gutbenberg ZIM #181

Closed Jaifroid closed 1 year ago

Jaifroid commented 1 year ago

I was confused by the fact that, recently, downloading EPUBs from a Gutenberg ZIM in Kiwix JS-family apps is producing a ZIP archive instead of an EPUB. This never used to happen. I've now found an older Gutenberg ZIM (from 2018), and have compared the stated MIME types. EPUB dirEntries in the old (version 0) ZIM are application/epub+zip (see BOTTOM screenshot) -- this is correct. However, in the same language (version 1) ZIM scraped last month, the MIME type is application/zip (see TOP screenshot + accompanying EPUB dirEntry).

Because we use the MIME type to determine the download type, the incorrect MIME type causes an issue for us. Possibly this issue hasn't been picked up in other readers because they may rely on the file extension, or give more weight to it?

The main difference between these two ZIMs is that the old one is a version 0 ZIM, and the new one is a version 1 ZIM. That may or may not be relevant.

INCORRECT MIME TYPE in 2023 ZIM:

image image

CORRECT MIME TYPE in 2018 ZIM:

image

rgaudin commented 1 year ago

Indeed, with this ZIM, it looks like all epubs are stored as application/zip. This is a regression from the switch to pylibzim as the mime is now guessed by libmagic…

mime nb
application/javascript 1487
application/pdf 6
application/vnd.ms-fontobject 1
application/vnd.ms-opentype 1
application/zip 794
audio/midi 7
audio/mpeg 21
font/sfnt 10
font/woff 1
image/gif 50
image/jpeg 7318
image/png 3722
image/svg+xml 1
image/vnd.microsoft.icon 1
text/css 13
text/html 1958
text/plain 1

Fixing it now

Jaifroid commented 1 year ago

Thanks for super-quick fix!