kanasimi / wikiapi

JavaScript MediaWiki API for node.js
https://kanasimi.github.io/wikiapi/
BSD 3-Clause "New" or "Revised" License
50 stars 6 forks source link

`.download()` : reduce calls to api.php, directly hit on https://upload.wikimedia.org #53

Closed hugolpz closed 2 years ago

hugolpz commented 2 years ago
/* ************************************************************* */
/* *********                                      ************** */
/* *********       REWRITING ONGOING.             ************** */
/* *********         DO NOT READ YET.             ************** */
/* *********                                      ************** */
/* ************************************************************* */

I don't know how cejs and WikiapiJS' .download() currently handles its queries, but I suspect it list the target files in an array via .categorymembers(), .search() or others which returns something such:

Categorymembers files:

// var files = await targetwiki.categorymembers('Category:Lingua Libre pronunciation-cmn', { namespace: 'File' });
// returns : 
[
  {"pageid":98560779,"ns":6,"title":"File:LL-Q9192 (cmn)-Assassas77-不.wav"},
  {"pageid":98560774,"ns":6,"title":"File:LL-Q9192 (cmn)-Assassas77-了.wav"},
  ....
  {"pageid":98560798,"ns":6,"title":"File:LL-Q9192 (cmn)-Assassas77-什么.wav"},
]

then use pageid to run a new API query for each file, get the url, and download the file.

Limits

Source Comment
https://commons.wikimedia.org/w/api.php? API queries have ratelimit
https://upload.wikimedia.org Direct downloads don't (I'm not 100% sure for that 😅 ).

Wikimedia API and group queries

There are Special:ApiSandbox queries which using a single API call can fetch by category name few hundred category member files, with exact file url and timestamp.

{
  "batchcomplete": "",
  "continue": {
    "gcmcontinue": "file|313235300a4c4c2d513135302028465241292d504f534c4f56495443482d313235302e574156|88497069",
    "continue": "gcmcontinue||"
  },
  "query": {
    "pages": {
      "82101585": { ... },
      "104331639": {
        "pageid": 104331639,
        "ns": 6,
        "title": "File:LL-Q150 (fra)-Kitel WP-%.wav",
        "imagerepository": "local",
        "imageinfo": [{
          "timestamp": "2021-04-25T15:49:00Z",
          "url": "https://upload.wikimedia.org/wikipedia/commons/a/a3/LL-Q150_%28fra%29-Kitel_WP-%25.wav",
          "descriptionurl": "https://commons.wikimedia.org/wiki/File:LL-Q150_(fra)-Kitel_WP-%25.wav",
          "descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=104331639"
        }]
      },
      "104381091": { ... }
    }
  }
}

title, timestamp and url are the most relevant properties I believe.

See also: API:Categorymembers, API:Allimages, API:Imageinfo.

Url

For files, the url property gives a direct download url allowing download from upload.wikimedia.org without additional API query on https://commons.wikimedia.org/w/api.php?. With one API query we can have 500 direct url to download from at higher speed.

Sustained "burst" management

The Wikimedia Discord api-bot channel made several input to this projects:

hugolpz commented 2 years ago

Hello Kanasimi,

Current speed

The .download() bentchmark section above shows .download() to have a conservatively slow performance. Per code analysis, I understand .download() takes as input objects such as:

{ 
  "pageid":104331639,
  "ns":6,
  "title":"File:LL-Q150 (fra)-Kitel WP-%.wav"
}

Such input means .download() initially miss the url and timestamp, therefore must run one additional API call per file, which is rate limited to get the url and timestamp. This results in the observed speed limitation of about 1 file per 2.7 seconds. This ratelimited API call also occurs when the file is eventually skipped because already present locally.

Speed optimization idea

However, await targetwiki.categorymembers('Category:Lingua Libre pronunciation-cmn', { namespace: 'File' });, when operating with namespace: 'File' (ns:6), could leverage smarter API queries requiring imageinfo (see section "Wikimedia API and group queries" above). Each smart API call run within .categorymember() can return 500 richer objects such as :

  { 
    "pageid":104331639,
    "ns":6,
    "title":"File:LL-Q150 (fra)-Kitel WP-%.wav",
    "timestamp": "2021-04-25T15:49:00Z",
    "url": "https://upload.wikimedia.org/wikipedia/commons/a/a3/LL-Q150_%28fra%29-Kitel_WP-%25.wav"
  },

If input with such format is passed down, .download() can :

Code to do (proposal)

  1. If .categorymembers() looks for Files, then use smart api generator, return array of object with pageid, ns, title AND url, timestamp.
  2. If .download()'s input options contains url && timestamp, then :
    • skip call on https://commons.wikimedia.org/w/api.php / do not call API
    • compare: if local file with same name exists, directly compare timestamp value with local file timestamp,
    • download: if no local file with same name OR update needed, directly download file via query on https://upload.wikimedia.org.
hugolpz commented 2 years ago

Questions

Q1: @kanasimi, the performances of this last benchmark and some calculation suggest that either the code is speed limited or each .download() run still make

Is that correct ?

Q2: How do you limit the speed ? You simply make those calls in synchronous js, right ?

Comment

I suspect some optimization ideas here could still be implemented if wanted.

hugolpz commented 2 years ago

Hello Kanasimi, Benchmark 3 done ! : ) I'm a bit blind on your recent changes but they appear to have speeded up the initial downloads.

We need slower coding and more discussion please. 🐨 🦦 I think the changes you implemented are different from those proposed here. Since you are good in JS you are used to go straight into coding, you move faster this way and it's usually the best choice, you indeed came up with nice improvements. But by doing so I'm sidelined, I don't know what you are coding, I can't confirm we think about the same idea. We both push for performance improvements, but it seem we pushed for different solutions. Can we discuss the proposed change above so we confirm we talk about the same thing, and we properly look at it, accept or deny it ? A brief explanation why this proposal is not practical for cejs, which is also possible, would be ok. See also the questions here where I try to understand the current code's approach.

.download() bentchmark (3)

I tested today with n=805.

Performance increase

compated to .download() bentchmark 7 days prior.** 🎇

Inpecting categorymembers()

I see

var files = await targetwiki.categorymembers(targetCategoryName, { namespace: 'File' });
console.log(files[0], files[1], files[2])

...still currently returns (cejs from today):

{
  pageid: 113936425,
  ns: 6,
  title: 'File:LL-Q9192 (cmn)-Luilui6666-102 块钱.wav'
} {
  pageid: 113936375,
  ns: 6,
  title: 'File:LL-Q9192 (cmn)-Luilui6666-102块钱.wav'
} {
  pageid: 113936358,
  ns: 6,
  title: 'File:LL-Q9192 (cmn)-Luilui6666-17-45.wav'
}

this is then passed to .download(). It miss the url and timestamp parameters which would allow faster execution with 500 times fewer API calls on ratelimited https://commons.wikimedia.org/w/api.php. See proposal here.

kanasimi commented 2 years ago

Hi.

When executing session.download('Category:name', ...), wiki_API_download() will:

  1. Get category tree without files, using session.category_tree(). session.category_tree() will use categoryinfo and categorymembers (category only) to increase speed.
  2. Back to wiki_API_download(). For each category, get imageinfo (with URL, latest date) with generator:categorymembers to get files in category.
  3. For each file, check timestamp and download new file.
kanasimi commented 2 years ago

You can try this now: await wiki.download('Category:Lingua Libre pronunciation', { directory: './downloads', max_threads: 4 });

hugolpz commented 2 years ago

Hi, thanks for this explanation above. 👍🏼

  1. For each file, check timestamp (#55) and download new file (#53).

IMPORTANT/CORE OF THIS ISSUE #53: what is your actual process in download new file, do you mean :

???

kanasimi commented 2 years ago

All imageinfo are get with generator:categorymembers. Should be B.

It seems the reason you did not use session.download('Category:name', ...) is that there is not a filter. Now you can try:

await wiki.download('Category:name', {
    directory: './',
    max_threads: 4,
    page_filter(page_data) {
        return page_data.title.includes('word');
    }
});

Multiple threads will much faster.

hugolpz commented 2 years ago

I'am double checking my packages and running a benchmark.

[EDIT1:] looks pretty good. about 200 downloads / minute.

[EDIT2:] A tiny surprise (see #56 ) but was easy to fix, use:

await targetwiki.download(
  targetwiki.to_namespace('Lingua_Libre_pronunciation-cmn', 'Category'), // <===========
  {....}
);

.download() benchmarks

One week ago (22.01.01)

Before today (22.01.13)

After today's fix (22.01.14)

Note

It's 3 times faster ! This could be attributed to multi-threads.

According to discord's chat 2 weeks ago :

An average of 100 downloads calls per second on https://upload.wikimeda.org is not unheard of but likeky requires careful request header and whitelisting.

So if the code uses solution B) from above, it can technically go much faster. I do not encourage so, a maximum should be about 5~8 download per second.

Questions

See #57.

Closure ?

@Kanasimi, if your code uses the solution B) from above (one single api.php call for 500 files and respective https://upload.wikimedia.org 500 calls), then this current issue 53 can be closed. This is pretty good already 🚀 👍🏼

kanasimi commented 2 years ago

Ok