openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
269 stars 70 forks source link

Icons in German Wikivoyage are not scraped and still point to upload.wikimedia.org #1807

Open Jaifroid opened 1 year ago

Jaifroid commented 1 year ago

This may be a similar issue to #1215. Very common icons on almost every page of German Wikivoyage are produced by background-image URLs in CSS that is embedded in the article page. In library.kiwix.org, e.g. here, the icons appear to show correctly, but they are in fact being drawn from the Web. (The new iframe sandbox will probably stop them from being accessed shortly.)

Here is dev tools snapshot of the above page, showing the issue (URLs displayed on right are from the in-page CSS, and correspond to the small icons you can see on the left):

image

On Kiwix JS and KJSWL, the page is blocked from accessing the Web, hence the icons are not shown, as I reported in https://github.com/kiwix/kiwix-js/issues/975. I thought it was a problem with the JS readers, but it turns out that they are working correctly, and the problem is that these icons have not been scraped and included in the ZIM.

kelson42 commented 1 year ago

Yes, this is a duplicate

kelson42 commented 1 year ago

Hmm... or maybe not. @pavel-karatsiuba Can you check please?

Jaifroid commented 1 year ago

If it's a duplicate, then I think it is a more general case of #1215 and has more info about the issue. So would suggest closing #1215 in favour of this, as it's not just about PDF icons. Of course that might be a different issue, I didn't investigate the specific case of PDFs.

pavel-karatsiuba commented 1 year ago

This is a duplicate but a more general case. Because not only pdf icons were not processed but other icons also. The problem is the icon's place. Icons path was placed in the CSS rule background-image that's why they are not processed.

kelson42 commented 1 year ago

@pavel-karatsiuba oh, then this probably easy to fix, you confirm?

pavel-karatsiuba commented 1 year ago

@kelson42 The process of treatment of such URLs is not so hard to implement. But we need to use an additional library to parse CSS rules. We can make it in the next way:

  1. with domino lib select all