openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Use favicon from crawler? #120

Open rgaudin opened 9 months ago

rgaudin commented 9 months ago

Since crawler 0.11.0 (https://github.com/webrecorder/browsertrix-crawler/pull/362), the captured favicon is available in pages.jsonl We could use that when a custom favicon is not provided instead of parsing the seed url ourselves.

benoit74 commented 1 week ago

We must still keep a fallback to parsing the seed url ourselves, since we cannot expect pages.jsonl to be always available (warc2zim must work from only a warc file, pages.jsonl is only available when warc2zim is using in conjunction with browsertrix crawler e.g. in zimit scraper)