Currently, when the scraper fails to process a WARC record, the scraper immediately stops.
Since this might happen after few minutes of processing (we have some huge WARC files to process), it both quite painful for the developer, especially since only logs are available after that.
We could consider to add an option to:
try/catch every WARC records processing
in DEBUG mode, add more logs when a WARC record fails (record headers, HTTP headers) and save record content base64 encoded in the output folder
add a CLI flag "--continue-on-error" and process accordingly (to avoid fixing issues one-by-one)
Currently, when the scraper fails to process a WARC record, the scraper immediately stops.
Since this might happen after few minutes of processing (we have some huge WARC files to process), it both quite painful for the developer, especially since only logs are available after that.
We could consider to add an option to:
This option should not be exposed in the Zimfarm.