openzim / wp1_selection_tools

Create selections with the best articles of a WM project
https://download.kiwix.org/wp1/
GNU General Public License v3.0
6 stars 3 forks source link

Where do article lists come from? #37

Closed wolfgang42 closed 2 years ago

wolfgang42 commented 2 years ago

(Filing this here as it's the closest I've found to an upstream but please redirect me if I've ended up in the wrong place.)

I've downloaded wikipedia_en_computer_nopic.zim from the Kiwix wiki and am playing around with it, but I'm pretty confused by the article selection. For example, the article The Art Life (“a blog about the art scene in Sydney, Australia”) is included, but ISO 8601 (an extremely common date format) is not.

I stumbled upon the zim-requests repo which lead me to the zimfarm config for this bundle, which seems to derive from a Computing.tsv. From what I gather via this comment I found, it seems that this list is managed by something in this (wp1_selection_tools) repo, but for me the trail ran cold here—I poked around in the scripts briefly and didn't see anything obvious, though I didn't delve too deeply.

Can you point me in the right direction to find where these lists are derived from? I assume there are categories or WikiProjects somewhere but I can't seem to find any documentation on this.

kelson42 commented 2 years ago

@wolfgang42 Your detailed description is right and you have understood things properly. I suspect this recent ticket talking about the very same ZIM file might help you https://github.com/openzim/mwoffliner/issues/1531. Let me know.

wolfgang42 commented 2 years ago

Yes, that does look like a similar question. I see you wrote:

This zim file is based on http://download.openzim.org/wp1/enwiki/projects/Computing.tsv which is the list of articles of the computing project https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Computing.

which is about what I figured (I wasn't sure if there were other sources as well), but it's not clear to me how to get from one to the other. Is there a certain category hierarchy that gets crawled by something to produce the full article list?

At a guess, it seems like the criteria is maybe that the article's talk page is in Category:All Computing articles, probably by Template:WikiProject Computing or a derivative—is that correct, and if so where would I find the code that does that?

kelson42 commented 2 years ago

The wikiproject articles are gathered (to the wp1 database) by the wp1 bot (openzim/wp1) and put in TSV list by the code in this repo.

If you want to have an insight in the our wp1 database, you can go to https://wp1.openzim.org.

... and your last sentence is correct.