Investigate using SPARQL to source cards instead of scraping dumps

tom-james-watson commented 2 years ago

I had trouble when I tried using SPARQL, which was actually my first approach to tackling this. Any queries that would be broad enough and return enough data would just time out. Maybe people in the community with more experience with SPARQL can help though!

tom-james-watson commented 2 years ago

By using SPARQL, we can more easily create data sets for specific subsets of things like "Books", "Battles" or "TV Shows". We would still want an "All" collection though, which would work similarly to the current version.

To be more specific on what would be needed here, I'd need to explain a rough outline of how the current processing works.

Take entire dump of english wikidata
Loop over every entry, as long as the entry has one of the following date values - https://github.com/tom-james-watson/wikitrivia-scraper/blob/0858260/src/main.rs#L32-L48
Process each item. The main chunk of that is happening here: https://github.com/tom-james-watson/wikitrivia-scraper/blob/0858260/src/item/process.rs#L266-L271. Here we reject items based on a few checks like undesirable instanceofs, labels/descriptions that are problematic etc.
We then also need to only include items that are sufficiently well-known. This is the difficult part. One vector for checking this is having a minimum number of wikidata sitelinks, which acts as a first filter as it is included directly in the dumps. We then also make a request to the wikipedia page views API to check how many views the associated wikipedia article gets, and also filter based on that.

In order to use SPARQL, we would need to be able to not only get enough items to populate the game, but we also need to be able to filter based on some heuristics to ensure that we only generate interesting cards that can reasonably be answered. A great vector for that is wikipedia page views, but there may also be better ways of doing this.

The current game has a list of around 10,000 cards, to give an idea of how many would be needed.

It's worth noting also that the kinds of cards that could be generated could be improved a lot as well. For example, you may have a card for Woodrow Wilson, who is well known enough to be included, but the card would be for when he is born, which is much more difficult. A better card would be to ask when he became president. Being able to detect what it is that is interesting about an entry and programmatically generating a card based on that would be great, though I imagine difficult.

What I get the feeling may make more sense would be to have many different SPARQL queries that specifically compose data based on things like the above example, so in that case a list of when famous world leaders came to power, and then to join all those datasets together.

It would also be possible to pass the results of a SPARQL query through another processing step to check the wikipedia pageviews should we feel like the wikidata sitelinks still doesn't provide a good enough vector for how well known the item is.

Some initial queries I was playing around with:

Famous Monarchs: https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fidcount%20%3Fsitelinks%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP106%20wd%3AQ116%3B%0A%20%20%20%20wdt%3AP31%20wd%3AQ5%3B%0A%20%20%20%20wikibase%3Aidentifiers%20%3Fidcount%3B%0A%20%20%20%20wikibase%3Asitelinks%20%3Fsitelinks.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%7D%0AORDER%20BY%20DESC%20%28%3Fsitelinks%29%0ALIMIT%20100

Famous Video Games: https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3FpublicationDate%20%3Fidcount%20%3Fsitelinks%20WHERE%20%7B%0A%20%20%3Fitem%0A%20%20%20%20wdt%3AP31%20wd%3AQ7889%3B%0A%20%20%20%20wdt%3AP577%20%3FpublicationDate%3B%0A%20%20%20%20wikibase%3Aidentifiers%20%3Fidcount%3B%0A%20%20%20%20wikibase%3Asitelinks%20%3Fsitelinks.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%7D%0AORDER%20BY%20DESC%20%28%3Fsitelinks%29%0ALIMIT%201000

namedgraph commented 2 years ago

Your query examples work fine :) So what was that actually did not work?

tom-james-watson commented 2 years ago

My problem basically is getting enough queries that provide enough interesting results as to be able to actually fill the game with at least ten thousand cards. Maybe what I should ask for is for people to provide SPARQL queries that provide interesting search results?

Here is what I would need in the query results:

id
label
date prop id
date
description
wikipedia title
instance of
sitelinks count

Maybe somebody can come up with a reusable snippet for providing those things and then people can concentrate on providing interesting queries?

I've made a discussion here where we can start collating queries: https://github.com/tom-james-watson/wikitrivia-scraper/discussions/8.

kidehen commented 2 years ago

You do have an option to query Wikidata deployed by other SPARQL Query Service providers e.g., the instance we host using our Virtuoso Platform.

tom-james-watson commented 2 years ago

Battles: https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fdate%20%3Fidcount%20%3Fsitelinks%20WHERE%20%7B%0A%20%20%3Fitem%0A%20%20%20%20wdt%3AP31%20wd%3AQ178561%3B%0A%20%20%20%20wdt%3AP585%20%3Fdate%3B%0A%20%20%20%20wikibase%3Aidentifiers%20%3Fidcount%3B%0A%20%20%20%20wikibase%3Asitelinks%20%3Fsitelinks.%0A%7D%0AORDER%20BY%20DESC%20%28%3Fsitelinks%29%0ALIMIT%201000

Paintings: https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fdate%20%3Fidcount%20%3Fsitelinks%20WHERE%20%7B%0A%20%20%3Fitem%0A%20%20%20%20wdt%3AP31%20wd%3AQ3305213%3B%0A%20%20%20%20wdt%3AP571%20%3Fdate%3B%0A%20%20%20%20wikibase%3Aidentifiers%20%3Fidcount%3B%0A%20%20%20%20wikibase%3Asitelinks%20%3Fsitelinks.%0A%7D%0AORDER%20BY%20DESC%20%28%3Fsitelinks%29%0ALIMIT%201000

tom-james-watson commented 2 years ago

Potentially useful resource: https://www.wikidata.org/wiki/Wikidata:Request_a_query

tuukka commented 2 years ago

From tom-james-watson/wikitrivia#26:

Here's a query for QLever that returns all suitable Wikidata items and their required attributes sorted by sitelinks count (pageviews is not available for queries). You can change "en" to any other language code: https://qlever.cs.uni-freiburg.de/wikidata/OycBUK

tom-james-watson commented 2 years ago

That's great, nice one! I think with that as a base it would be possible to formulate more complex, specific queries that we could then add together to build up an interesting set.

What would be good to avoid is too many "boring" things like when was <administrative region> created. What is a difficult problem is trying to avoid items that have a lot of sitelinks / page views, but the actual date associated with that place is maybe not that interesting. E.g. France is going to rank highly, but the founding date of France is a) maybe uninteresting and b) probably debatable.

I think more interesting things to see are things like:

events
discoveries
inventions
creation of creative works

I think that's why it may be easier to instead stitch together multiple more "niche" queries and therefore avoid having too many results that are just death of <famous person> or creation of <city>

Theo-Strongin commented 2 months ago

Hi Tom, huge fan of your game!

I wrote some Python code to generate cards from the Wikidata Query Service API, focusing solely on events. I've attached the code and the output it currently generates to this message. It creates about 10,000 cards (but I've set a low threshold for the minimum acceptable number of site links per item).

Currently, the code works by sending the API separate queries for individual event categories and combining the outputs. I can modify the code to get different sets of cards, such as discoveries and inventions or creative works as you suggest above.

I generated the list of categories using this QLever query to find the most common categories of Wikidata items with the start date (P580) property. Here is the query I used.

Do you have any suggestions for ways I could improve my code?

cards.json generator.txt

tom-james-watson / wikitrivia-generator

Investigate using SPARQL to source cards instead of scraping dumps #6