openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Retrieving upstream favicon to set illustration 48x48 is not smart enough #352

Closed kelson42 closed 1 month ago

kelson42 commented 1 month ago

Scraping https://womenshistory.si.edu/ with an extensive set of good favicon/illustrations:

<link rel="icon" sizes="16x16" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon-16x16.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon-16x16.png)" />
<link rel="icon" sizes="32x32" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon-32x32.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon-32x32.png)" />
<link rel="icon" sizes="96x96" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon-96x96.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon-96x96.png)" />
<link rel="icon" sizes="192x192" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/android-chrome-192x192.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/android-chrome-192x192.png)" />
<link rel="apple-touch-icon" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-60x60.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-60x60.png)" />
<link rel="apple-touch-icon" sizes="72x72" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-72x72.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-72x72.png)" />
<link rel="apple-touch-icon" sizes="76x76" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-76x76.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-76x76.png)" />
<link rel="apple-touch-icon" sizes="114x114" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-114x114.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-114x114.png)" />
<link rel="apple-touch-icon" sizes="120x120" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-120x120.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-120x120.png)" />
<link rel="apple-touch-icon" sizes="144x144" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-144x144.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-144x144.png)" />
<link rel="apple-touch-icon" sizes="152x152" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-152x152.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-152x152.png)" />
<link rel="apple-touch-icon" sizes="180x180" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-180x180.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-180x180.png)" />
<link rel="apple-touch-icon-precomposed" sizes="180x180" href="[https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-precomposed.png](view-source:https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/apple-icon-precomposed.png)" />

... But warc2zim seems not able to find a good one image

I suspect it does not looks this new favicon entries

Might be related to https://github.com/openzim/warc2zim/issues/120

kelson42 commented 1 month ago

.. but maybe here we face a bigger scraping problem image

benoit74 commented 1 month ago

This looks like a crawling issue, due to something which detected the crawler and prevented it from operating. Nothing we can fix at code level.

benoit74 commented 1 month ago

Looks like using "Pixel 5" as mobile device is not triggering the "protection".

kelson42 commented 1 month ago

I believe the favicon taken as illustration is not in a high resolution enough.

benoit74 commented 1 month ago

I confirm that scraper takes first "icon" and should take into consideration the "sizes" provided to select the 48x48 one or the biggest one (so that we resize from highest res possible, to avoid side effects from downsizing from something too close to 48x48 and having to resample fractions of pixels).