openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

ValueError: Incorrect HttpUrl scheme in value #332

Open rgaudin opened 6 days ago

rgaudin commented 6 days ago

This zimit run failed rewriting an URL with an exotic scheme (intent).

ValueError: Incorrect HttpUrl scheme in value: intent://show/0vT7LJMeVDxyQ2ZamHKu08?si=_uDRD6bRR_6M5YZyISG

Non-http/https schemes are valid and should be ignored instead of being rewritten. Not sure if this statement is not in line with current implementation or if it's just a bug…

benoit74 commented 6 days ago

If I'm not mistaken, non-http(s) schemes are ignored only when rewriting URLs.

Here the error happens during the initial gather_information_from_warc where we loop over all entries, so I strongly suspect we have a WARC record associated with this URL. Didn't knew browsertrix crawler was capable of retrieving non-http(s) resources. Glad I implemented this check to avoid pushing stupid things to the ZIM. However such entries should probably just be ignored instead of raising an error and stopping the process.