openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

New request: CIA World Factbook - Lite version #774

Open gd4c opened 6 months ago

gd4c commented 6 months ago

Please use the following format for a ZIM creation request (and delete unnecessary information)

The full CIA factbook zim is > 6GB. I don't really need all the included PDF documents and other space consumers. Just the country/region texts, flags and maps. Shouldn't be more than a couple hundred MB.

Thanks for your work.

RavanJAltaie commented 6 months ago

@benoit74 is this request doable?

benoit74 commented 6 months ago

@gd4c could you please specify the exact list of web pages that would cover the country/region texts, flags and maps? I'm not sure to understand what you are speaking about and would avoid too much back-and-forth on this request

@RavanJAltaie yes this is probably doable with custom --include settings, tbc with the exact list of pages.

Popolechien commented 6 months ago

I'd like to discuss this with @RavanJAltaie first as it would be nice to have a policy regarding custom content so as to avoid discussing ad hoc requests. Seeing as there are 200+ countries and territories in the world, this one particular request looks tedious.

benoit74 commented 6 months ago

@Popolechien OK ; processing 200+ country does not necessarily means that we have to process all of them one by one. Maybe there are some patterns on what has to be included and what has to be excluded.

Honestly I did not spent even 5 minutes to have a look into the ZIM and understand why it consume 6GB and how we could (or not) create a lite version.

Anyway, your point about having a policy is still valid ^^

gd4c commented 6 months ago

@benoit74 Certainly

All 262 countries and regions linked here: https://www.cia.gov/the-world-factbook/countries/

The World & Its Regions: https://www.cia.gov/the-world-factbook/ (scroll down)

Oceans: https://www.cia.gov/the-world-factbook/ (scroll down)

Country Comparisons: https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons/

For each country/region/ocean, keep the text, maps (and possibly in-text images) but exclude linked PDF, audio and photos.

Take Albania as an example:

In short, if you programmatically exclude .PDF, .mp3, and country landscape photos, you are 95% of the way there.

gd4c commented 6 months ago

There is also this. Let me know if I should make a separate issue for it.

Rexadev commented 5 months ago

Related https://github.com/openzim/zim-requests/issues/593

RavanJAltaie commented 5 months ago

@Popolechien Let's discuss this today & decide about it.

benoit74 commented 5 months ago

In short, if you programmatically exclude .PDF, .mp3, and country landscape photos, you are 95% of the way there.

In short, this is not feasible as-of-today ; the scraper can only be configured to exclude certain pages, but not resources within pages. I've opened an upstream ticket for now: https://github.com/openzim/zimit/issues/278

There is also this. Let me know if I should make a separate issue for it.

This is probably way simpler to do. You should open a dedicated issue, yes. And we will keep this issue focused on CIA Factbook lite version (which could still make sense).