jackrusher / mundaneum

A clojure wrapper around WikiData
BSD Zero Clause License
128 stars 16 forks source link
clojure dsl sparql wikidata

Mundaneum

This is a thin Clojure wrapper around the Wikidata project's massive semantic database. It's named after the Mundaneum, which was Paul Otley's mad and wonderful c. 1910 vision for something like the World Wide Web.

(There's a mini-doc about him and it here.)

Coordinates

Recent changes to the group-id policy at Clojars have made publishing artifacts there less appealing to me, so for the moment this library is available to deps.edn users at:

io.github.jackrusher/mundaneum {:git/sha "SHA"}

Motivation

Wikidata is amazing! And it provides API access to all the knowledge it has collected! This is great, but exploratory programmatic access to that data can be fairly painful.

The official Wikidata API Java library offers a document-oriented interface that makes it hard to ask interesting questions. A better way to do most things is with the Wikidata query service, which uses the standard Semantic Web query language, SPARQL.

The SPARQL query service is nice, but because WikiData's data model must cope with (a) items with multiple names in multiple languages, and (b) single names that map to multiple items, they've used a layer of abstraction by which everything in the DB is referred to by an id that looks like P50 (property number 50, meaning "author") or Q6882 (entity number 6882, the author "James Joyce").

For example, to get a selection of works authored by James Joyce, one would issue a query like:

SELECT ?work
WHERE { ?work wdt:P50 wd:Q6882. } 
LIMIT 10

(Users of Datomic will recognize the ?work style of selector, which is not a coincidence as SPARQL and Datomic are both flavors of Datalog.)

The above query is simple enough, except for the non-human-readable identifiers in the WHERE clause, which were both found by manually searching the web interface at Wikidata.

In order to do exploratory programming against this API in a more human-friendly way without leaving my coding environment, I've built this library. The approach I took was:

(wdt :author)
;;=> :wdt/P50
(entity "James Joyce")
;;=> :wd/Q6882

;; the entity function tries to return the most notable entity 
;; that matches, but sometimes that isn't what you want.

(describe (entity "U2"))
;;=> "Irish alternative rock band"

;; not the one I meant, let's try with more info:
(describe (entity "U2" (wdt :part-of) (entity "Berlin U-Bahn")))
;;=> "underground line in Berlin"

This already helps to keep my emacs-driven process running smoothly. The next point of irritation was assembling query strings by hand, like an animal. Luckily, the very well put together Flint library provides an excellent Clojure DSL for the SPARQL query language. Combined with my helper functions, this looks like:

;; what are some works authored by James Joyce?
(query `{:select [?work ?workLabel]
         :where  [[?work ~(wdt :author) ~(entity "James Joyce")]]
         :limit 10}
;; [{:work "Q864141", :workLabel "Eveline"}
;;  {:work "Q861185", :workLabel "A Little Cloud"}
;;  {:work "Q459592", :workLabel "Dubliners"}
;;  {:work "Q682681", :workLabel "Giacomo Joyce"}
;;  {:work "Q764318", :workLabel "Two Gallants"}
;;  {:work "Q429967", :workLabel "Chamber Music"}
;;  {:work "Q465360", :workLabel "A Portrait of the Artist as a Young Man"}
;;  {:work "Q6511", :workLabel "Ulysses"}
;;  {:work "Q866956", :workLabel "An Encounter"}
;;  {:work "Q6507", :workLabel "Finnegans Wake"}] 

This is actually quite similar to the programmatic query interface I created for the first purpose-built TripleStore around 20 years ago.

This code is much easier to understand if you have some familiarity with SPARQL and how it can be used to query Wikidata. I strongly recommend this introduction to get started. I've tried to make sure all the examples are easy to translate to the DSL used here.

Multilingual support

In queries, a language-specific string can be specified with a map containing an ISO language code keyword as key and a string as value, like {:en "human"}. This can also be used to specify the language to use for an entity lookup:

(entity {:de "Mensch"})
;;=> :wd/Q5

The label and describe functions can also take an extra first parameter indicating the language to use:

(label :de :wd/Q5)
;;=> "Mensch"
(describe :fr :wd/Q5)
;;=> "individu appartenant à l’espèce Homo sapiens, la seule espèce restante du genre Homo – distinct de « humain fictif » et de « humain possiblement fictif »"

Note that these calls are all memoized per language, so repeatedly looking up a given entity/label/description causes no additional network traffic.

In addition to these affordances, there is also a dynamic variable mundaneum.query/*default-language* which is an atom containing an ISO language code keyword like :en that controls which language will be used by default for labels and description input/output. If you are planning to enjoy an interactive session in French you could set the default like this:

(reset! *default-language* :fr)

On the other hand, if you want to mix languages freely, you can use a local binding like this:

;; lookup an entity using Thai as the default language, then get the
;; English label for it.
(let [thai-name "กรุงเทพมหานคร"
      id (binding [*default-language* :th]
           (entity thai-name))]
  (str thai-name " is called " (label id) " in English."))
;;=> "กรุงเทพมหานคร is called Bangkok in English."

Although if one is doing something like this, it's probably nicer to use the previously described API:

(describe (entity {:th "กรุงเทพมหานคร"}))
"capital of Thailand"

Learn more

Additional documentation can be found in the Clerk notebooks in the notebooks directory, beginning with basics.clj. If you start your REPL with the :dev alias, you'll already have Clerk loaded. (This will happen automatically if you use cider-jack-in from Emacs via a bit of configuration in this repo's .dir-locals file.)

Enjoy!

License

Copyright © 2016-2022 Jack Rusher. Distributed under the BSD 0-clause license.