Open rgaudin opened 6 days ago
If I'm not mistaken, non-http(s) schemes are ignored only when rewriting URLs.
Here the error happens during the initial gather_information_from_warc
where we loop over all entries, so I strongly suspect we have a WARC record associated with this URL. Didn't knew browsertrix crawler was capable of retrieving non-http(s) resources. Glad I implemented this check to avoid pushing stupid things to the ZIM. However such entries should probably just be ignored instead of raising an error and stopping the process.
This zimit run failed rewriting an URL with an exotic scheme (
intent
).Non-
http
/https
schemes are valid and should be ignored instead of being rewritten. Not sure if this statement is not in line with current implementation or if it's just a bug…