Open audiodude opened 2 years ago
Seems really important indeed.
I knew I used pagination in the past, but it was by calling the wbsearchentities
endpoint of the Wikibase API.
Looking at the user manual for the SPARQL endpoint, it looks like there are hard limits on query execution (60 seconds) and no way of paginating through arbitrarily large result sets. The advice is to pare down your query or run your own WDQS (!).
Additionally, a particular UA + IP combo is only allowed 60 seconds of query time per 60 seconds of wall time. So looking ahead, if we had many users using the WP1 service, we would not be able to materialize all of the queries in parallel.
I'm wondering if there's someone we can contact to get our UA whitelisted for a special exemption somehow, or if there is anything we can do to avoid setting up a copy of WDQS with the limits removed, which will take a month according to the docs:
Warning: As of 2020, in Wikimedia servers it will take about 12 days to get all data in the dump imported, and another 12 days to make the query service catching up the lag.
@audiodude Can you please open a ticket first in phabricator? Then, if necessary, we would ping a few people at the WMF.
I think I need to better understand the use case first before I can open a phabricator ticket. I imagine the first thing they're going to ask me is "what kinds of queries do you expect to run?" and I don't really know the answer. So far, I've gotten the timeout only when running a malformed query that selected every English Wikipedia article. Clearly that's not realistically what we want to do.
Would something like "Every living person" or "Every geographic location" be a more reasonable example? I can craft SPARQL queries for those pretty easily and see if they time out. Would our needs generally be more narrow or even more broad than those examples?
For the former, I ran this query for "Count the number of humans who have a birthdate but not a deathdate":
SELECT (COUNT(*) AS ?count)
WHERE {
?item wdt:P31 wd:Q5 ;
wdt:P569 ?birth;
wdt:P570 ?death;
FILTER(BOUND(?birth) && !BOUND(?death))
}
And it timed out, couldn't even count that many nevermind return their article URLs.
Any thoughts, @kelson42 ?
@audiodude Correct me if I get it wrong:
Do you confirm I get it right? To me the second problem is far more worrying that the first, because how the Wikidata backend is going to deal with large resultset?
Current timeout is short and we don't actually really know so far what would be a proper value for us
It's 60 seconds. It seems like that's not enough to generate lists with hundreds of thousands of articles, but I don't really know. That's the other part of your statement: we don't know what would be a good value for us. So this is true.
There is no way to deal with pagination of the resultset
That's right. From what I can tell, the WDQS streams results until it hits a timeout, which shows up as a Java stacktrace, which it then appends directly in the middle of the response JSON. So the visible effect of hitting the timeout is that you get "Unparseable JSON" errors, but really it's just timing out. It seems their strategy is: if your results fit in the 60 second limit, you get them all back as one big response.
@audiodude Thx for conforming, I recommend to open two separate tickets. Not sure either what woukf be a proper timeout, but the combination of the two limitations is deadly.
Okay, added https://phabricator.wikimedia.org/T319150 and https://phabricator.wikimedia.org/T319151 and added you as subscriber to both.
It seems the only thing we can obtain is a longer timeout and this is tracked here https://phabricator.wikimedia.org/T179879
@audiodude I propose to remove this task from the project milestone.
Done, thanks.
517 added support for materializing SPARQL queries. However, it works by only looking at a single page of results. For queries that are expected to have a much larger result set, we should use pagination, which the Wikidata SPARQL endpoint supports.