`.download()` scalability ?

hugolpz commented 2 years ago

Hi there, I'm using WikiapiJS to code a wikiapi-egg (script) which will download all Commons files from target categories. My 3 largest target categories currently have about 50k audios files each, files being of 1.5KB each. Do you know:

Q1: Does WikiapiJS has such scale up in mind ?
Q2: What are the API limitations for such mass download ?
- Listing items: Mediawiki API has a Categorymembers limit. cmlimit=500 for regular users, cmlimit=5000 if apihighlimits userright.
- Downloads: I don't see limits on download themselves.
Q3: Do you handle categories with more than 500 files successfully ? (API limit)
Q4: Do you skip already downloaded files efficiently ? (quickly)
Q5: Do you compare local and remote files creation dates so to re-download from Commons when a new version is available ?
Q6: What should i avoid to not be blocked ?

Scale up

It's to provides the public direct and convenient dumps of LinguaLibre's audio assets on a per language basis. We want to create periodic (weekly?) dumps on our Lili server.

We want to keep a local dump synchronized based on Wikimedia Commons. We are talking about 700,000 files so far. According to tests duration above, the initial synchronization would take 21 days, that is ok. But the later "updates" a week later would require about 15 days while only 1~2% of new files (7,000-15,000) will require a download.

Do you have possible optimization at sight ?

WikiapiJS download worked on tiny categories (files =12). See #48 code. I'm currently reluctant to test further by fear of being banned.

`.download()` bentchmark (1)

Ok, I decided to test anyway on a category with n=369.

Initial attempt :
- categorymembers=369
- downloads=369
- runing time: 16min or 960sec --> 2.7s./file
Removed 14 files from local directory
Update attempt:
- categorymembers=369
- downloads=14
- runing time: 9min or 540sec --> 38.6s./file

kanasimi commented 2 years ago

I did not imagine this, but I reserved the possibility for this.
You know I have the apihighlimits permission in wiki commons, so I often use this... but it should work for users without apihighlimits permission too. There should no limit for downloading large categories.
I have processed for categories with 100K+ files.
Yes, the library will skip files existed.
No. I have not coding this yet.
I have never heard of wiki commons blocking peoples downloading files, so...

kanasimi commented 2 years ago

What categories do you want to synchronize?

hugolpz commented 2 years ago

Aim is to provide convenient dumps for each category in Category:Lingua_Libre_pronunciation. The largest ones are 60k (ben) and 250k (fra) files strong. The whole 130 categories contain 700,000 files.

The point for WikiapiJS .download() is scalability, ability to handle such large categories with resilience and speed, their initial download and their later periodic update. Ideally weekly.

kanasimi commented 2 years ago

Well, it seems I need to do some works...

hugolpz commented 2 years ago

Nice !

hugolpz commented 2 years ago

This scale up question is handled in two related issues:

[x] #55
[x] #53
[x] #57

kanasimi / wikiapi