Open gd4c opened 6 months ago
@benoit74 is this request doable?
@gd4c could you please specify the exact list of web pages that would cover the country/region texts, flags and maps? I'm not sure to understand what you are speaking about and would avoid too much back-and-forth on this request
@RavanJAltaie yes this is probably doable with custom --include
settings, tbc with the exact list of pages.
I'd like to discuss this with @RavanJAltaie first as it would be nice to have a policy regarding custom content so as to avoid discussing ad hoc requests. Seeing as there are 200+ countries and territories in the world, this one particular request looks tedious.
@Popolechien OK ; processing 200+ country does not necessarily means that we have to process all of them one by one. Maybe there are some patterns on what has to be included and what has to be excluded.
Honestly I did not spent even 5 minutes to have a look into the ZIM and understand why it consume 6GB and how we could (or not) create a lite version.
Anyway, your point about having a policy is still valid ^^
@benoit74 Certainly
All 262 countries and regions linked here: https://www.cia.gov/the-world-factbook/countries/
The World & Its Regions: https://www.cia.gov/the-world-factbook/ (scroll down)
Oceans: https://www.cia.gov/the-world-factbook/ (scroll down)
Country Comparisons: https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons/
For each country/region/ocean, keep the text, maps (and possibly in-text images) but exclude linked PDF, audio and photos.
Take Albania as an example:
In short, if you programmatically exclude .PDF, .mp3, and country landscape photos, you are 95% of the way there.
There is also this. Let me know if I should make a separate issue for it.
@Popolechien Let's discuss this today & decide about it.
In short, if you programmatically exclude .PDF, .mp3, and country landscape photos, you are 95% of the way there.
In short, this is not feasible as-of-today ; the scraper can only be configured to exclude certain pages, but not resources within pages. I've opened an upstream ticket for now: https://github.com/openzim/zimit/issues/278
There is also this. Let me know if I should make a separate issue for it.
This is probably way simpler to do. You should open a dedicated issue, yes. And we will keep this issue focused on CIA Factbook lite version (which could still make sense).
Please use the following format for a ZIM creation request (and delete unnecessary information)
The full CIA factbook zim is > 6GB. I don't really need all the included PDF documents and other space consumers. Just the country/region texts, flags and maps. Shouldn't be more than a couple hundred MB.
Thanks for your work.