openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
364 stars 25 forks source link

Images missing even when under cutoff value #70

Closed Popolechien closed 3 years ago

Popolechien commented 3 years ago

I just zimmed up a wordpress blog with 186 articles (cutoff at 1,000) and about 500 images (https://mesquartierschinois.wordpress.com). Standard, free wordpress, ie no funky extension added.

I would say 10-20% of images are still missing.

kelson42 commented 3 years ago

I believe I can confirm the problem, with latest wiki.openzim.org scrape, I had myself a few images missing.

rgaudin commented 3 years ago

Please share the ZIM and indications of where to find such images so we can look at what's special about them. Task ID or link as well so we can check the logs.

Popolechien commented 3 years ago

https://farm.youzim.it/pipeline/d1c2f201514f3da67f887df5 for the task - images are missing on every second page or so.

rgaudin commented 3 years ago

OK, I've looked into this. It is also related to image's srcset but it's not fixed by #63 (which added support for them).

What we're seeing here is that the crawler is not making requests for all of the images in the srcset (or those fail to complete) ; so there are missing images. Depending on the one your browser picks (kinda hard to predict but you can inspect what its trying to display) you may got one that was crawled or not.

I've also found that it's sort of random in selecting which image gets crawled…

I've opened a ticket upstream: https://github.com/webrecorder/browsertrix-crawler/issues/3

@kelson42 should we keep that open until it gets solved upstream?

kelson42 commented 3 years ago

@rgaudin Yes, please keep this ticket open please.