openzim / wp1_selection_tools

Create selections with the best articles of a WM project
https://download.kiwix.org/wp1/
GNU General Public License v3.0
6 stars 3 forks source link
selection wikipedia wp1

The WP1 Selection tools gather and compile multiple indicators to provide Wikipedia article subset selections. It has been created for the Wikipedia 1.0 project and is complementary of the WP1 engine.

The results are made available at https://download.openzim.org/wp1.

CodeFactor License: GPL v3

Requirements

To run it, you need:

Context

Many Wikipedias, in different languages, have more than 500.000 articles and even if we can provide offline versions with a reasonnable size, this is still too much for many devices. That's why we need to build offline versions with only a selections with the TOP best articles.

Principle

This tool builds lists of key values (pageviews, links, ...) about Wikipedia articles and put them in a directory. These key values are everything we have as input to build smart selection algorithms. To get more detalis about the list, read the README in the language based directory.

Tools

Download

You can download the output of that scripts directly from download.kiwix.org/wp1/ using FTP, HTTP(s) or rsync.

You might be interested by downloading only the last version, here is a small command (based on rsync) to retrieve the right directory name.

for ENTRY in $(rsync --recursive --list-only download.kiwix.org::download.kiwix.org/wp1/ | tr -s ' ' | cut -d ' ' -f5 | grep wiki | grep -v '/' | sort -r)
do
    RADICAL=`echo $ENTRY | sed 's/_20[0-9][0-9]-[0-9][0-9]//g'`;
    if [[ $LAST != $RADICAL ]]
    then
        echo $ENTRY
        LAST=$RADICAL
    fi
done

VPS

To run it on VPS via Docker:

docker run -d --name wp1_selection_tools
  -v /srv/wp1_selection_tools/data:/data \
  -v /srv/wp1_selection_tools/.ssh/:/root/.ssh \
  -v /srv/wp1_selection_tools/replica.my.cnf:/root/replica.my.cnf \
  ghcr.io/openzim/wp1_selection_tools

License

GPLv3 or later, see LICENSE for more details.