kanasimi / wikiapi

JavaScript MediaWiki API for node.js
https://kanasimi.github.io/wikiapi/
BSD 3-Clause "New" or "Revised" License
50 stars 6 forks source link

`.download()` scalability ? #51

Closed hugolpz closed 2 years ago

hugolpz commented 2 years ago

Hi there, I'm using WikiapiJS to code a wikiapi-egg (script) which will download all Commons files from target categories. My 3 largest target categories currently have about 50k audios files each, files being of 1.5KB each. Do you know:

Scale up

It's to provides the public direct and convenient dumps of LinguaLibre's audio assets on a per language basis. We want to create periodic (weekly?) dumps on our Lili server.

We want to keep a local dump synchronized based on Wikimedia Commons. We are talking about 700,000 files so far. According to tests duration above, the initial synchronization would take 21 days, that is ok. But the later "updates" a week later would require about 15 days while only 1~2% of new files (7,000-15,000) will require a download.

Do you have possible optimization at sight ?

WikiapiJS download worked on tiny categories (files =12). See #48 code. I'm currently reluctant to test further by fear of being banned.


.download() bentchmark (1)

Ok, I decided to test anyway on a category with n=369.

kanasimi commented 2 years ago
  1. I did not imagine this, but I reserved the possibility for this.
  2. You know I have the apihighlimits permission in wiki commons, so I often use this... but it should work for users without apihighlimits permission too. There should no limit for downloading large categories.
  3. I have processed for categories with 100K+ files.
  4. Yes, the library will skip files existed.
  5. No. I have not coding this yet.
  6. I have never heard of wiki commons blocking peoples downloading files, so...
kanasimi commented 2 years ago

What categories do you want to synchronize?

hugolpz commented 2 years ago

Aim is to provide convenient dumps for each category in Category:Lingua_Libre_pronunciation. The largest ones are 60k (ben) and 250k (fra) files strong. The whole 130 categories contain 700,000 files.

The point for WikiapiJS .download() is scalability, ability to handle such large categories with resilience and speed, their initial download and their later periodic update. Ideally weekly.

kanasimi commented 2 years ago

Well, it seems I need to do some works...

hugolpz commented 2 years ago

Nice !

hugolpz commented 2 years ago

This scale up question is handled in two related issues: