Open Popolechien opened 1 month ago
All maps which are not on https://www.cia.gov/the-world-factbook/maps/world-regional/ first page are missing.
This is due to the fact that "Next page" button is not a real hyperlink but a button which loads next page via Javascript code.
Browsertrix crawler only explore hyperlinks found in pages.
Countries pages have all been fetched most probably thanks to hyperlinks found on comparisons pages: https://www.cia.gov/the-world-factbook/field/population/country-comparison/, not from https://www.cia.gov/the-world-factbook/countries/ (which has the same "Next" button)
Solution to this issue consists in creating a custom behavior to click on these next links. This is work to be done by a developer, expect about 2 hours of work. Would be a great first use case (we do not use custom behaviors much), especially since I find this ZIM quite valuable and it's probably an interesting flagship product.
Edit: All maps which are not on https://www.cia.gov/the-world-factbook/maps/world-regional/ first page are missing.
Same problem impact photos which are not on the main page, e.g. https://dev.library.kiwix.org/content/theworldfactbook_en_all_2023-05/A/www.cia.gov/the-world-factbook/countries/afghanistan/images
Same solution can be applied (most probably same code will tackle all problems in fact)
I've deleted the file from production since it is too significantly incomplete
What changed between Zimit 1 (where this ZIM worked fine) and Zimit 2 (where we need a custom behaviour)? Surely on the browsing side the main change is the browser version (?). This ZIM was highly dynamic, e.g. it would load thumbnails for all pages and then load the full-res images after a delay, all done with JS from what I could tell.
In sum, are we sure it's a crawler issue, and not some warc2zim interaction with the highly dynamic nature of the contents?
The ZIM which is incomplete is a Zimit1 file.
ZIM(s) location
https://library.kiwix.org/viewer#theworldfactbook_en_all_2023-12/A/www.cia.gov/the-world-factbook/
Recipe(s) URL
https://farm.openzim.org/recipes/CIAworldfactbook_en_all/edit
Readers tested
Both ZIM versions impacted?
Yes, both versions are impacted
Details
When checking out Maps it appears that some of the PDF files are missing (try looking for the Political Europe map on https://download.kiwix.org/zim/zimit/theworldfactbook_en_all_2023-12.zim for instance).
I have restarted the recipe but has any one any idea why the scraping should be incomplete?