openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
275 stars 72 forks source link

How to download only pages of a category? #2066

Open LAfricain opened 1 month ago

LAfricain commented 1 month ago

I would like to create a file with all the page with the Ancien_Testament category. I run this:

mwoffliner --mwUrl=https://fr.wikipedia.org/ --getCategories=Ancien_Testament --outputDirectory=./Bible --adminEmail=xxxx@xxxx.e --verbose

But it seems to download much more! How to have only the page of this category and how to add more then 1 category. By instance I would like to have the Ancien_Testament, and Nouveau_Testament categories together...

audiodude commented 1 month ago

Where are you getting the param --getCategories from? I don't see it in https://github.com/openzim/mwoffliner/blob/main/src/parameterList.ts

In general, mwoffliner does not have the concept of wiki "categories", it only operates on "article lists".

However you could use WP1 to do this.

  1. Login to WP1: https://wp1.openzim.org/
  2. Go to https://wp1.openzim.org/#/selections/petscan to create a "Petscan collection".
  3. Select fr.wikipedia.org and use this Petscan URL in the URL field: https://petscan.wmcloud.org/?psid=28962290
  4. Wait for your selection and ZIM file to be created.
LAfricain commented 1 month ago

@audiodude thank you for the link to wp1.openzim.org, someone send me there yesterday. It can help me. Thank you for the perscan, it's exactly what I wanted. But how to add categories to the petscan? I would like to have the both, old and new testament?

And for the --getCategories I got it in the

mwoffliner --help
...
  --getCategories             [WIP] Download category pages
audiodude commented 1 month ago

Petscan takes a list of categories. They are formatted just as the category name. So for instance:

https://en.wikipedia.org/wiki/Thekla_(daughter_of_Theophilos)

Has the categories:

9th-century births | 9th-century deaths | (and others....)

You can put either or both of these on https://petscan.wmcloud.org/ in the "Categories" box. If you want everything from all of the categories, use the "Union" button under Combination. Also be sure to set the "Depth" to the appropriate value in order to get subcategories.

As for:

And for the --getCategories I got it in the

mwoffliner --help
...
  --getCategories             [WIP] Download category pages

This is an experimental feature in an older version of mwoffliner that was never fully developed. I also believe, from looking at the code, that the intent was to fetch the "Category pages" of articles, not to download articles based on a given category.

Hope this helps.