openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

CIA World factbook is incomplete #988

Open Popolechien opened 1 month ago

Popolechien commented 1 month ago

ZIM(s) location

https://library.kiwix.org/viewer#theworldfactbook_en_all_2023-12/A/www.cia.gov/the-world-factbook/

Recipe(s) URL

https://farm.openzim.org/recipes/CIAworldfactbook_en_all/edit

Readers tested

Both ZIM versions impacted?

Yes, both versions are impacted

Details

When checking out Maps it appears that some of the PDF files are missing (try looking for the Political Europe map on https://download.kiwix.org/zim/zimit/theworldfactbook_en_all_2023-12.zim for instance).

I have restarted the recipe but has any one any idea why the scraping should be incomplete?

benoit74 commented 1 month ago

All maps which are not on https://www.cia.gov/the-world-factbook/maps/world-regional/ first page are missing.

This is due to the fact that "Next page" button is not a real hyperlink but a button which loads next page via Javascript code.

Browsertrix crawler only explore hyperlinks found in pages.

Countries pages have all been fetched most probably thanks to hyperlinks found on comparisons pages: https://www.cia.gov/the-world-factbook/field/population/country-comparison/, not from https://www.cia.gov/the-world-factbook/countries/ (which has the same "Next" button)

Solution to this issue consists in creating a custom behavior to click on these next links. This is work to be done by a developer, expect about 2 hours of work. Would be a great first use case (we do not use custom behaviors much), especially since I find this ZIM quite valuable and it's probably an interesting flagship product.

benoit74 commented 1 month ago

Edit: All maps which are not on https://www.cia.gov/the-world-factbook/maps/world-regional/ first page are missing.

benoit74 commented 1 month ago

Same problem impact photos which are not on the main page, e.g. https://dev.library.kiwix.org/content/theworldfactbook_en_all_2023-05/A/www.cia.gov/the-world-factbook/countries/afghanistan/images

Same solution can be applied (most probably same code will tackle all problems in fact)

benoit74 commented 3 weeks ago

I've deleted the file from production since it is too significantly incomplete

Jaifroid commented 3 weeks ago

What changed between Zimit 1 (where this ZIM worked fine) and Zimit 2 (where we need a custom behaviour)? Surely on the browsing side the main change is the browser version (?). This ZIM was highly dynamic, e.g. it would load thumbnails for all pages and then load the full-res images after a delay, all done with JS from what I could tell.

In sum, are we sure it's a crawler issue, and not some warc2zim interaction with the highly dynamic nature of the contents?

benoit74 commented 3 weeks ago

The ZIM which is incomplete is a Zimit1 file.