Closed benoit74 closed 4 days ago
I propose an enhancement: also automatically replace bad assets which are not excluded with the "bad image here", count them, and fail the scraper only if a given threshold of assets are missing and not excluded (meaning too many assets are not ok).
At first run, we can set the threshold to a very high value to create a ZIM no matter what. Then we can use the logs to get the list of failing assets, configure the regex, and set the threshold to a very low value (e.g. 0.1% just to not stop every run every time one more asset is failing and not yet configured in the exclude list).
Sounds good. Clearly marked as dev
We've had following exception in https://farm.openzim.org/pipeline/dc8c87a5-dfda-4b84-bc35-4fa2032b1d43
Looking online, http://mygeologypage.ucdavis.edu/sumner/gel109/labs/USGSExplanation.jpg, the host is really not reachable.
It might even be permanently down.
How should we handle such things where the asset (an image of one page) is not reachable. We want to avoid creating ZIMs with broken images, but here the image is broken at the source.
I propose to: