openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
275 stars 72 forks source link

Scraping dies with "Input buffer contains unsupported image format" if logo returns 301 #2028

Open benoit74 opened 3 months ago

benoit74 commented 3 months ago

Zimfarm recipe: https://farm.openzim.org/recipes/encyclopediaofmath.org_en_all

Zim-request details: https://github.com/openzim/zim-requests/issues/964#issuecomment-2124736481

Log:

[error] [2024-05-22T12:56:44.462Z] Failed to run mwoffliner after [18s]: {
    "stack": "Error: Input buffer contains unsupported image format",
    "message": "Input buffer contains unsupported image format"
}
[error] [2024-05-22T12:56:44.462Z] 

**********

Input buffer contains unsupported image format

**********

It is pretty hard to tell which image has been grabbed and failed to be read, making the fix even more delicate.

audiodude commented 3 months ago

From googling, it looks like that's an error message from the sharp library.

Probably happening here?

https://github.com/openzim/mwoffliner/blob/9cc613fb246150e25df68041332a869810018cf9/src/Downloader.ts#L488

Might just be a bad URL that's getting erroneously read as image data.

audiodude commented 3 months ago

The proximate cause of the error is the fact that the URL:

https://encyclopediaofmath.org/common/spr_logo.gif

Does a 301 to https://encyclopediaofmath.org/wiki/Main_Page

Presumably, this "logo" link is somewhere in the initial metadata that mwoffliner gathers about the wiki. So the bug is that mwoffliner is hardcoded to download this as image data and doesn't consider 301 to be an error status. It then crashes early on in the scraping process.

audiodude commented 3 months ago

If the folks putting in the request would like to fix their problem without waiting for mwoffliner, I would suggest putting an image (even a 1x1 PNG) at that URL.

benoit74 commented 3 months ago

Thank you!

Would specifying a custom favicon with --customZimFavicon allow to bypass the code trying to download the "logo"?

audiodude commented 3 months ago

Yes, I believe that would work

kelson42 commented 2 months ago

Here the solution is IMHO to test the format of the image at the same time (early) like other ZIM metadata.