openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Warc2zim hanged forever #246

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

Task: https://farm.openzim.org/pipeline/b7a1a162-671c-4cb5-bda9-8b4b917efbc2/debug

This is most probably a recent regression in Warc2zim 2

At 2024-05-11 08:07:42, warc2zim started

At 2024-05-11 08:08:36, it had collected the metadata

Then nothing more in the log for almost 48h, so I cancelled the task at 2024-05-13 06:27

benoit74 commented 1 month ago

Problem is around handling of redirections. Scraper stores only redirections from ZimPath to ZimPath, but it means in some cases we are storing an redirection to self (e.g. redirection from http://www.kiwix.org to https://www.kiwix.org is equivalent in terms of ZimPath). These redirections should simply be ignored since they are already considered equal in terms of ZimPath.

benoit74 commented 1 month ago

(and this was conducting to a dead loop)