Closed Popolechien closed 3 years ago
I believe I can confirm the problem, with latest wiki.openzim.org scrape, I had myself a few images missing.
Please share the ZIM and indications of where to find such images so we can look at what's special about them. Task ID or link as well so we can check the logs.
https://farm.youzim.it/pipeline/d1c2f201514f3da67f887df5 for the task - images are missing on every second page or so.
OK, I've looked into this. It is also related to image's srcset
but it's not fixed by #63 (which added support for them).
What we're seeing here is that the crawler is not making requests for all of the images in the srcset (or those fail to complete) ; so there are missing images. Depending on the one your browser picks (kinda hard to predict but you can inspect what its trying to display) you may got one that was crawled or not.
I've also found that it's sort of random in selecting which image gets crawled…
I've opened a ticket upstream: https://github.com/webrecorder/browsertrix-crawler/issues/3
@kelson42 should we keep that open until it gets solved upstream?
@rgaudin Yes, please keep this ticket open please.
I just zimmed up a wordpress blog with 186 articles (cutoff at 1,000) and about 500 images (https://mesquartierschinois.wordpress.com). Standard, free wordpress, ie no funky extension added.
I would say 10-20% of images are still missing.