Handle case where assets are "bad"

benoit74 commented 1 week ago

We've had following exception in https://farm.openzim.org/pipeline/dc8c87a5-dfda-4b84-bc35-4fa2032b1d43

requests.exceptions.ConnectionError: HTTPConnectionPool(host='mygeologypage.ucdavis.edu', port=80): Max retries exceeded with url: /sumner/gel109/labs/USGSExplanation.jpg (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f5750c031a0>: Failed to resolve 'mygeologypage.ucdavis.edu' ([Errno -2] Name or service not known)"))

Looking online, http://mygeologypage.ucdavis.edu/sumner/gel109/labs/USGSExplanation.jpg, the host is really not reachable.

It might even be permanently down.

How should we handle such things where the asset (an image of one page) is not reachable. We want to avoid creating ZIMs with broken images, but here the image is broken at the source.

I propose to:

add a flag with the regex of known bad assets URLs
add a "bad image here" image in the ZIM (to be designed ...)
replace all the assets matching the regex with a redirect to the "bad image here" image

benoit74 commented 1 week ago

I propose an enhancement: also automatically replace bad assets which are not excluded with the "bad image here", count them, and fail the scraper only if a given threshold of assets are missing and not excluded (meaning too many assets are not ok).

At first run, we can set the threshold to a very high value to create a ZIM no matter what. Then we can use the logs to get the list of failing assets, configure the regex, and set the threshold to a very low value (e.g. 0.1% just to not stop every run every time one more asset is failing and not yet configured in the exclude list).

rgaudin commented 1 week ago

Sounds good. Clearly marked as dev

openzim / mindtouch

Handle case where assets are "bad" #76