Closed hugolpz closed 2 years ago
Hello Kanasimi,
The .download() bentchmark
section above shows .download()
to have a conservatively slow performance.
Per code analysis, I understand .download()
takes as input objects such as:
{
"pageid":104331639,
"ns":6,
"title":"File:LL-Q150 (fra)-Kitel WP-%.wav"
}
Such input means .download()
initially miss the url
and timestamp
, therefore must run one additional API call per file, which is rate limited to get the url
and timestamp
. This results in the observed speed limitation of about 1 file per 2.7 seconds. This ratelimited API call also occurs when the file is eventually skipped because already present locally.
However, await targetwiki.categorymembers('Category:Lingua Libre pronunciation-cmn', { namespace: 'File' });
, when operating with namespace: 'File'
(ns:6
), could leverage smarter API queries requiring imageinfo
(see section "Wikimedia API and group queries" above). Each smart API call run within .categorymember()
can return 500 richer objects such as :
{
"pageid":104331639,
"ns":6,
"title":"File:LL-Q150 (fra)-Kitel WP-%.wav",
"timestamp": "2021-04-25T15:49:00Z",
"url": "https://upload.wikimedia.org/wikipedia/commons/a/a3/LL-Q150_%28fra%29-Kitel_WP-%25.wav"
},
If input with such format is passed down, .download()
can :
https://commons.wikimedia.org/w/api.php
url
value used to directly download from https://upload.wikimedia.org
. According to discussion on Wikimedia's Discord, this api point accept 20~100 times more requests than the other one (if header request properly set).
-timestamp
value used to directly compare with local files creation times. Files can be compared in milliseconds instead of 0.29 seconds. (factor x10-100).categorymembers()
looks for Files
, then use smart api generator, return array of object with pageid, ns, title AND url
, timestamp
..download()
's input options contains url
&& timestamp
, then :
https://commons.wikimedia.org/w/api.php
/ do not call APIhttps://upload.wikimedia.org
.Q1: @kanasimi, the performances of this last benchmark and some calculation suggest that either the code is speed limited or each .download()
run still make
https://commons.wikimedia.org/w/api.php
to fetch url
and timestamp
parametershttps://upload.wikimedia.org
for the file itself.Is that correct ?
Q2: How do you limit the speed ? You simply make those calls in synchronous js, right ?
I suspect some optimization ideas here could still be implemented if wanted.
Hello Kanasimi, Benchmark 3 done ! : ) I'm a bit blind on your recent changes but they appear to have speeded up the initial downloads.
We need slower coding and more discussion please. 🐨 🦦 I think the changes you implemented are different from those proposed here. Since you are good in JS you are used to go straight into coding, you move faster this way and it's usually the best choice, you indeed came up with nice improvements. But by doing so I'm sidelined, I don't know what you are coding, I can't confirm we think about the same idea. We both push for performance improvements, but it seem we pushed for different solutions. Can we discuss the proposed change above so we confirm we talk about the same thing, and we properly look at it, accept or deny it ? A brief explanation why this proposal is not practical for cejs, which is also possible, would be ok. See also the questions here where I try to understand the current code's approach.
I tested today with n=805.
Performance increase
compated to .download() bentchmark
7 days prior.** 🎇
categorymembers()
I see
var files = await targetwiki.categorymembers(targetCategoryName, { namespace: 'File' });
console.log(files[0], files[1], files[2])
...still currently returns (cejs from today):
{
pageid: 113936425,
ns: 6,
title: 'File:LL-Q9192 (cmn)-Luilui6666-102 块钱.wav'
} {
pageid: 113936375,
ns: 6,
title: 'File:LL-Q9192 (cmn)-Luilui6666-102块钱.wav'
} {
pageid: 113936358,
ns: 6,
title: 'File:LL-Q9192 (cmn)-Luilui6666-17-45.wav'
}
this is then passed to .download()
. It miss the url
and timestamp
parameters which would allow faster execution with 500 times fewer API calls on ratelimited https://commons.wikimedia.org/w/api.php
. See proposal here.
Hi.
When executing session.download('Category:name', ...)
,
wiki_API_download() will:
You can try this now: await wiki.download('Category:Lingua Libre pronunciation', { directory: './downloads', max_threads: 4 });
Hi, thanks for this explanation above. 👍🏼
IMPORTANT/CORE OF THIS ISSUE #53: what is your actual process in download new file, do you mean :
pageid
commons.wikimedia.org/w/api.php
to fetch the true url
value, (rate limited, slow!)upload.wikimedia.org
via url
value
OR!url
values (500 times faster since one API call for 500 files! As now used for file comparison)upload.wikimedia.org
via url
value (same)???
All imageinfo are get with generator:categorymembers. Should be B.
It seems the reason you did not use session.download('Category:name', ...)
is that there is not a filter.
Now you can try:
await wiki.download('Category:name', {
directory: './',
max_threads: 4,
page_filter(page_data) {
return page_data.title.includes('word');
}
});
Multiple threads will much faster.
I'am double checking my packages and running a benchmark.
[EDIT1:] looks pretty good. about 200 downloads / minute.
[EDIT2:] A tiny surprise (see #56 ) but was easy to fix, use:
await targetwiki.download(
targetwiki.to_namespace('Lingua_Libre_pronunciation-cmn', 'Category'), // <===========
{....}
);
.download()
benchmarksIt's 3 times faster ! This could be attributed to multi-threads.
According to discord's chat 2 weeks ago :
An average of 100 downloads calls per second on
https://upload.wikimeda.org
is not unheard of but likeky requires careful request header and whitelisting.
So if the code uses solution B)
from above, it can technically go much faster.
I do not encourage so, a maximum should be about 5~8 download per second.
See #57.
@Kanasimi, if your code uses the solution B) from above (one single api.php
call for 500 files and respective https://upload.wikimedia.org
500 calls), then this current issue 53 can be closed. This is pretty good already 🚀 👍🏼
Ok
I don't know how cejs and WikiapiJS'
.download()
currently handles its queries, but I suspect it list the target files in an array via.categorymembers()
,.search()
or others which returns something such:Categorymembers files:
then use pageid to run a new API query for each file, get the url, and download the file.
Limits
ratelimit
Wikimedia API and group queries
There are Special:ApiSandbox queries which using a single API call can fetch by category name few hundred category member files, with exact file url and timestamp.
title
,timestamp
andurl
are the most relevant properties I believe.See also: API:Categorymembers,
API:Allimages, API:Imageinfo.Url
For files, the url property gives a direct download url allowing download from
upload.wikimedia.org
without additional API query onhttps://commons.wikimedia.org/w/api.php?
. With one API query we can have 500 direct url to download from at higher speed.Sustained "burst" management
The Wikimedia Discord api-bot channel made several input to this projects: