openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Raise warnings when there is a conflict of http/https and/or ports and/or ... #275

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

Do we want to raise a warning in the logs (or fail the scraper?) when we have two WARC records leading to the same ZIM Path, most probably due to a conflict of http/https URLs ?

Would be great if we can ensure the warning is displayed only when the resource is really different, but it is made hard by HTTP redirections.

Not sure it is really worth it (at least we have lots a debug message ""Skipping duplicate {url}, already added to ZIM", so this has to be analyzed in details.