openzim / wp1_selection_tools

Create selections with the best articles of a WM project
https://download.kiwix.org/wp1/
GNU General Public License v3.0
6 stars 3 forks source link

When 1M articles isn't enough (and 6.25M articles is too many, e.g. with EN WP!) #33

Closed holta closed 3 years ago

holta commented 3 years ago

The file 1000000.tsv at http://download.openzim.org/wp1/enwiki_2020-12/tops/ (indicating the 1 Million "top" articles) is not large enough for communities that want larger extracts of Wikipedia that contain ~2 Million or more articles.

These are impoverished communities that cannot afford the disk space for all 6.25 Million articles (almost 100GB in the case of English Wikipedia, which also leads to microSD card theft).

Can longer-than-1M (ordered) lists of Wikipedia articles please be made available in future? Thank you all for considering !!

Ref: openzim/mwoffliner#1399

kelson42 commented 3 years ago

@holta why not taking the list with all titles and cut it at 2M by yourself?

holta commented 3 years ago

@holta why not taking the list with all titles and cut it at 2M by yourself?

Is there an ordered list (of all 6.25M articles) published monthly somewhere?

(Do you know where I should look for it if so?)

kelson42 commented 3 years ago

scores.tsv, see http://download.openzim.org/wp1/enwiki_2021-02/README

holta commented 3 years ago

Great! Sorry I didn't realize.