openzim / wp1

Wikipedia 1.0 engine & selection tools
https://wp1.openzim.org
GNU General Public License v2.0
24 stars 17 forks source link

Make SPARQL materialization work even for queries with large result sets, using pagination #519

Open audiodude opened 2 years ago

audiodude commented 2 years ago

517 added support for materializing SPARQL queries. However, it works by only looking at a single page of results. For queries that are expected to have a much larger result set, we should use pagination, which the Wikidata SPARQL endpoint supports.

kelson42 commented 2 years ago

Seems really important indeed.

audiodude commented 1 year ago

I knew I used pagination in the past, but it was by calling the wbsearchentities endpoint of the Wikibase API.

Looking at the user manual for the SPARQL endpoint, it looks like there are hard limits on query execution (60 seconds) and no way of paginating through arbitrarily large result sets. The advice is to pare down your query or run your own WDQS (!).

Additionally, a particular UA + IP combo is only allowed 60 seconds of query time per 60 seconds of wall time. So looking ahead, if we had many users using the WP1 service, we would not be able to materialize all of the queries in parallel.

I'm wondering if there's someone we can contact to get our UA whitelisted for a special exemption somehow, or if there is anything we can do to avoid setting up a copy of WDQS with the limits removed, which will take a month according to the docs:

Warning: As of 2020, in Wikimedia servers it will take about 12 days to get all data in the dump imported, and another 12 days to make the query service catching up the lag.

kelson42 commented 1 year ago

@audiodude Can you please open a ticket first in phabricator? Then, if necessary, we would ping a few people at the WMF.

audiodude commented 1 year ago

I think I need to better understand the use case first before I can open a phabricator ticket. I imagine the first thing they're going to ask me is "what kinds of queries do you expect to run?" and I don't really know the answer. So far, I've gotten the timeout only when running a malformed query that selected every English Wikipedia article. Clearly that's not realistically what we want to do.

Would something like "Every living person" or "Every geographic location" be a more reasonable example? I can craft SPARQL queries for those pretty easily and see if they time out. Would our needs generally be more narrow or even more broad than those examples?

audiodude commented 1 year ago

For the former, I ran this query for "Count the number of humans who have a birthdate but not a deathdate":

SELECT (COUNT(*) AS ?count)
WHERE {
  ?item wdt:P31 wd:Q5 ;
        wdt:P569 ?birth;
        wdt:P570 ?death;

  FILTER(BOUND(?birth) && !BOUND(?death))
}

And it timed out, couldn't even count that many nevermind return their article URLs.

audiodude commented 1 year ago

Any thoughts, @kelson42 ?

kelson42 commented 1 year ago

@audiodude Correct me if I get it wrong:

Do you confirm I get it right? To me the second problem is far more worrying that the first, because how the Wikidata backend is going to deal with large resultset?

audiodude commented 1 year ago

Current timeout is short and we don't actually really know so far what would be a proper value for us

It's 60 seconds. It seems like that's not enough to generate lists with hundreds of thousands of articles, but I don't really know. That's the other part of your statement: we don't know what would be a good value for us. So this is true.

There is no way to deal with pagination of the resultset

That's right. From what I can tell, the WDQS streams results until it hits a timeout, which shows up as a Java stacktrace, which it then appends directly in the middle of the response JSON. So the visible effect of hitting the timeout is that you get "Unparseable JSON" errors, but really it's just timing out. It seems their strategy is: if your results fit in the 60 second limit, you get them all back as one big response.

kelson42 commented 1 year ago

@audiodude Thx for conforming, I recommend to open two separate tickets. Not sure either what woukf be a proper timeout, but the combination of the two limitations is deadly.

audiodude commented 1 year ago

Okay, added https://phabricator.wikimedia.org/T319150 and https://phabricator.wikimedia.org/T319151 and added you as subscriber to both.

kelson42 commented 1 year ago

It seems the only thing we can obtain is a longer timeout and this is tracked here https://phabricator.wikimedia.org/T179879

kelson42 commented 1 year ago

@audiodude I propose to remove this task from the project milestone.

audiodude commented 1 year ago

Done, thanks.