openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

solar.lowtechmagazine.com high res images are missing #1027

Open benoit74 opened 3 weeks ago

benoit74 commented 3 weeks ago

Clicking on the icon for non-dithered images (beneath each image in an article) doesn't work. The higher-res images haven't been scraped and show a missing image placeholder.

We should dev a custom behavior to scrape these images

benoit74 commented 3 weeks ago

Or we could decide that high res images are not worth it (they will increase ZIM size by a significant number of bytes, rough estimate is that it will at least double the ZIM size). And hence just add a custom CSS to hide the button allowing to grab full res images.

Jaifroid commented 3 weeks ago

The original lowtechmagazine (when it wasn't on a solar-powered server) had full-res images. The last version I have of that is 343MB, compared with 310MB for this ZIM. Although I don't think that one was multilingual (presumably text is well compressed in the ZIM, so might not make much difference). Dithered images are small by design. I'm not necessarily arguing for having hi-res images, but users might expect to be able to view those images and may feel the ZIM is "broken" if they click those full-res buttons and get a broken image placeholder.

@kelson42 and @Popolechien should take a decision on this, but it shouldn't delay release of Zimit2.

benoit74 commented 3 weeks ago

If we prefer to keep the ZIM small and hence not include the full-res images, it is obvious that it means we have to hide the full-res buttons.

kelson42 commented 3 weeks ago

I have the feeling we don't take the problem from the right side. First of all, the default behaviour should respect the original Web page so here we should try to reproduce the original behaviour.

From the data perspective, the original data is stored in an img standard attribute data-original. Therefore we should should downlaod/rewrite that URL. There is nothing I understand as "custom" to this Web site.

benoit74 commented 3 weeks ago

Sorry, I've probably been a bit too fast in my explanations.

The high res images are not shown because they are not inside the ZIM.

They are not inside the ZIM because they are not inside the WARC.

They are not inside the WARC because they are not loaded by the crawler browser, because they are not needed until the data-original attribute is "transferred" as img src when a user click the button.

The only (standard) solution to have the image inside the WARC is to have the browser load them. And the (standard) way to do it is via what Webrecorder team has called a custom behavior. A custom behavior is a tiny bit a JS code which will be run by the crawler. Behaviors are already used in standard by the crawler to simulate scrolling (autoscroll), playing videos (autoplay) and fetch img srcsets and stylesheets (autofetch). There is no standard crawler behavior which explores these data-xxx attributes.

We could either create a custom behavior specific for this website (just simulating the user click on the given button, easy to develop, but not reusable) or a more generic one loading all data-xxx custom attributes which looks like a URL (might be a bit tricky since the data could be anything, so one has to ensure it is a real URL, absolute or relative, but this will be reusable across many websites, should probably be developed directly into the crawler codebase).

Once the image will be inside the WARC, everything else will work as expected, no need to modify rewriting (here rewriting is done with wombat - i.e. dynamic rewriting - hence the URL not yet been rewritten in the data-original attribute, this is intended to avoid double rewriting / simplify rewriting).

First of all, the default behaviour should respect the original Web page so here we should try to reproduce the original behaviour.

I'm sorry but I don't get what this means, do you mean that hiding the button should only be a last resort solution if nothing else is reasonably possible?