`.download()` : reduce calls to api.php, directly hit on https://upload.wikimedia.org

kanasimi / wikiapi

JavaScript MediaWiki API for node.js

https://kanasimi.github.io/wikiapi/

BSD 3-Clause "New" or "Revised" License

47 stars 5 forks source link

`.download()` : reduce calls to api.php, directly hit on https://upload.wikimedia.org #53

Closed hugolpz closed 2 years ago

hugolpz commented 2 years ago

/* ************************************************************* */
/* *********                                      ************** */
/* *********       REWRITING ONGOING.             ************** */
/* *********         DO NOT READ YET.             ************** */
/* *********                                      ************** */
/* ************************************************************* */

I don't know how cejs and WikiapiJS' .download() currently handles its queries, but I suspect it list the target files in an array via .categorymembers(), .search() or others which returns something such:

Categorymembers files:

// var files = await targetwiki.categorymembers('Category:Lingua Libre pronunciation-cmn', { namespace: 'File' });
// returns : 
[
  {"pageid":98560779,"ns":6,"title":"File:LL-Q9192 (cmn)-Assassas77-不.wav"},
  {"pageid":98560774,"ns":6,"title":"File:LL-Q9192 (cmn)-Assassas77-了.wav"},
  ....
  {"pageid":98560798,"ns":6,"title":"File:LL-Q9192 (cmn)-Assassas77-什么.wav"},
]

then use pageid to run a new API query for each file, get the url, and download the file.

Limits

Source	Comment
https://commons.wikimedia.org/w/api.php?	API queries have `ratelimit`
https://upload.wikimedia.org	Direct downloads don't (I'm not 100% sure for that 😅 ).

Wikimedia API and group queries

There are Special:ApiSandbox queries which using a single API call can fetch by category name few hundred category member files, with exact file url and timestamp.

{
  "batchcomplete": "",
  "continue": {
    "gcmcontinue": "file|313235300a4c4c2d513135302028465241292d504f534c4f56495443482d313235302e574156|88497069",
    "continue": "gcmcontinue||"
  },
  "query": {
    "pages": {
      "82101585": { ... },
      "104331639": {
        "pageid": 104331639,
        "ns": 6,
        "title": "File:LL-Q150 (fra)-Kitel WP-%.wav",
        "imagerepository": "local",
        "imageinfo": [{
          "timestamp": "2021-04-25T15:49:00Z",
          "url": "https://upload.wikimedia.org/wikipedia/commons/a/a3/LL-Q150_%28fra%29-Kitel_WP-%25.wav",
          "descriptionurl": "https://commons.wikimedia.org/wiki/File:LL-Q150_(fra)-Kitel_WP-%25.wav",
          "descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=104331639"
        }]
      },
      "104381091": { ... }
    }
  }
}

title, timestamp and url are the most relevant properties I believe.

Url

For files, the url property gives a direct download url allowing download from upload.wikimedia.org without additional API query on https://commons.wikimedia.org/w/api.php?. With one API query we can have 500 direct url to download from at higher speed.

Sustained "burst" management

The Wikimedia Discord api-bot channel made several input to this projects:

meta:User-Agent_policy
Discord discussion
- concensus: limit to 5~8 concurrent downloads.
- @Anticomposite: make requests from only one IP
- @Ciencia-Al-Poder: file requests are not stored on normal filesystem. Due to the large set of files uploaded to commons, they're stored on swift (IIRC), and it may incure in some overhead specially when requesting files that nobody requested in a long time
- @Marreromarco using Wikiget reported an average of 10 download per seconde (730,000 files in 20hours).
- Done by running Wikiget in 20 terminals in parallel.
- An average of 100 downloads per second is not unheard of but likeky requires careful request header and whitelisting.
- Resilience when internet connexion is unstable may help.

hugolpz commented 2 years ago

Hello Kanasimi,

Current speed

The .download() bentchmark section above shows .download() to have a conservatively slow performance. Per code analysis, I understand .download() takes as input objects such as:

{ 
  "pageid":104331639,
  "ns":6,
  "title":"File:LL-Q150 (fra)-Kitel WP-%.wav"
}

Such input means .download() initially miss the url and timestamp, therefore must run one additional API call per file, which is rate limited to get the url and timestamp. This results in the observed speed limitation of about 1 file per 2.7 seconds. This ratelimited API call also occurs when the file is eventually skipped because already present locally.

Speed optimization idea

However, await targetwiki.categorymembers('Category:Lingua Libre pronunciation-cmn', { namespace: 'File' });, when operating with namespace: 'File' (ns:6), could leverage smarter API queries requiring imageinfo (see section "Wikimedia API and group queries" above). Each smart API call run within .categorymember() can return 500 richer objects such as :

  { 
    "pageid":104331639,
    "ns":6,
    "title":"File:LL-Q150 (fra)-Kitel WP-%.wav",
    "timestamp": "2021-04-25T15:49:00Z",
    "url": "https://upload.wikimedia.org/wikipedia/commons/a/a3/LL-Q150_%28fra%29-Kitel_WP-%25.wav"
  },

If input with such format is passed down, .download() can :

skip the ratelimited API call on https://commons.wikimedia.org/w/api.php
url value used to directly download from https://upload.wikimedia.org. According to discussion on Wikimedia's Discord, this api point accept 20~100 times more requests than the other one (if header request properly set). -timestamp value used to directly compare with local files creation times. Files can be compared in milliseconds instead of 0.29 seconds. (factor x10-100)

Code to do (proposal)

If .categorymembers() looks for Files, then use smart api generator, return array of object with pageid, ns, title AND url, timestamp.
If .download()'s input options contains url && timestamp, then :
- skip call on https://commons.wikimedia.org/w/api.php / do not call API
- compare: if local file with same name exists, directly compare timestamp value with local file timestamp,
- download: if no local file with same name OR update needed, directly download file via query on https://upload.wikimedia.org.

hugolpz commented 2 years ago

Questions

Q1: @kanasimi, the performances of this last benchmark and some calculation suggest that either the code is speed limited or each .download() run still make

one API call on https://commons.wikimedia.org/w/api.php to fetch url and timestamp parameters
one call on https://upload.wikimedia.org for the file itself.

Is that correct ?

Q2: How do you limit the speed ? You simply make those calls in synchronous js, right ?

Comment

I suspect some optimization ideas here could still be implemented if wanted.

hugolpz commented 2 years ago

Hello Kanasimi, Benchmark 3 done ! : ) I'm a bit blind on your recent changes but they appear to have speeded up the initial downloads.

We need slower coding and more discussion please. 🐨 🦦 I think the changes you implemented are different from those proposed here. Since you are good in JS you are used to go straight into coding, you move faster this way and it's usually the best choice, you indeed came up with nice improvements. But by doing so I'm sidelined, I don't know what you are coding, I can't confirm we think about the same idea. We both push for performance improvements, but it seem we pushed for different solutions. Can we discuss the proposed change above so we confirm we talk about the same thing, and we properly look at it, accept or deny it ? A brief explanation why this proposal is not practical for cejs, which is also possible, would be ok. See also the questions here where I try to understand the current code's approach.

.download() bentchmark (3)

I tested today with n=805.

Initial downloads :
- categorymembers=805
- downloads=805
- runing time: 9'53" (593secs)
- average : 0.74 sec/download.
Removed 20 files from local directory
Update downloads:
- categorymembers=805
- downloads=20
- runing time: 5'20" (320secs)
- average : 0.40 sec/file
- average : 16 sec/download

Performance increase

Initial downloads: 3.8 times (2.7s->0.74s) improvement
Update downloads: 2.4 times (38.6s -> 16s) improvement

compated to .download() bentchmark 7 days prior.** 🎇

Inpecting `categorymembers()`

I see

var files = await targetwiki.categorymembers(targetCategoryName, { namespace: 'File' });
console.log(files[0], files[1], files[2])

...still currently returns (cejs from today):

{
  pageid: 113936425,
  ns: 6,
  title: 'File:LL-Q9192 (cmn)-Luilui6666-102 块钱.wav'
} {
  pageid: 113936375,
  ns: 6,
  title: 'File:LL-Q9192 (cmn)-Luilui6666-102块钱.wav'
} {
  pageid: 113936358,
  ns: 6,
  title: 'File:LL-Q9192 (cmn)-Luilui6666-17-45.wav'
}

this is then passed to .download(). It miss the url and timestamp parameters which would allow faster execution with 500 times fewer API calls on ratelimited https://commons.wikimedia.org/w/api.php. See proposal here.

kanasimi commented 2 years ago

Hi.

When executing session.download('Category:name', ...), wiki_API_download() will:

Get category tree without files, using session.category_tree(). session.category_tree() will use categoryinfo and categorymembers (category only) to increase speed.
Back to wiki_API_download(). For each category, get imageinfo (with URL, latest date) with generator:categorymembers to get files in category.
For each file, check timestamp and download new file.

kanasimi commented 2 years ago

You can try this now: await wiki.download('Category:Lingua Libre pronunciation', { directory: './downloads', max_threads: 4 });

hugolpz commented 2 years ago

Hi, thanks for this explanation above. 👍🏼

For each file, check timestamp (#55) and download new file (#53).

IMPORTANT/CORE OF THIS ISSUE #53: what is your actual process in download new file, do you mean :

A)
- a) use the pageid
- b) call to commons.wikimedia.org/w/api.php to fetch the true url value, (rate limited, slow!)
- c) then hit on upload.wikimedia.org via url value OR!
B)
- a) use generator:categorymembers's previously stored 500 url values (500 times faster since one API call for 500 files! As now used for file comparison)
- b) then directly hit on upload.wikimedia.org via url value (same)

???

kanasimi commented 2 years ago

All imageinfo are get with generator:categorymembers. Should be B.

It seems the reason you did not use session.download('Category:name', ...) is that there is not a filter. Now you can try:

await wiki.download('Category:name', {
    directory: './',
    max_threads: 4,
    page_filter(page_data) {
        return page_data.title.includes('word');
    }
});

Multiple threads will much faster.

hugolpz commented 2 years ago

I'am double checking my packages and running a benchmark.

[EDIT1:] looks pretty good. about 200 downloads / minute.

[EDIT2:] A tiny surprise (see #56 ) but was easy to fix, use:

await targetwiki.download(
  targetwiki.to_namespace('Lingua_Libre_pronunciation-cmn', 'Category'), // <===========
  {....}
);

`.download()` benchmarks

One week ago (22.01.01)

Initial downloads :
- categorymembers=369
- downloads=369
- runing time: 960sec --> 2.7s./file

Before today (22.01.13)

Initial downloads :
- categorymembers=829
- downloads=829
- runing time: 760s --> 0.9s/ ⬇️

After today's fix (22.01.14)

Initial downloads :
- categorymembers=829
- downloads=829
- runing time: 250s --> 0.30s/ ⬇️

Note

It's 3 times faster ! This could be attributed to multi-threads.

According to discord's chat 2 weeks ago :

An average of 100 downloads calls per second on https://upload.wikimeda.org is not unheard of but likeky requires careful request header and whitelisting.

So if the code uses solution B) from above, it can technically go much faster. I do not encourage so, a maximum should be about 5~8 download per second.

Questions

See #57.

Closure ?

@Kanasimi, if your code uses the solution B) from above (one single api.php call for 500 files and respective https://upload.wikimedia.org 500 calls), then this current issue 53 can be closed. This is pretty good already 🚀 👍🏼

kanasimi commented 2 years ago

kanasimi / wikiapi

`.download()` : reduce calls to api.php, directly hit on https://upload.wikimedia.org #53

Limits

Wikimedia API and group queries

Url

Sustained "burst" management

Current speed

Speed optimization idea

Code to do (proposal)

Questions

Comment

.download() bentchmark (3)

Inpecting categorymembers()

.download() benchmarks

One week ago (22.01.01)

Before today (22.01.13)

After today's fix (22.01.14)

Note

Questions

Closure ?

Inpecting `categorymembers()`

`.download()` benchmarks