Include subtypes in Wikidata types to CSL mapping

nichtich commented 5 years ago

The mapping from Wikidata types and CSL document types lacks a lot of Wikidata types. Wikidata has more than 1000 document types to support (including several false positives) so the full list should be updated automatically and stored in a JSON file.

P.S: This SPARQL query gives the publication type hierarchy from Wikidata. We only need to hard-code mapping to CSL document types for broad concepts so we can derive all subtypes.

larsgw commented 5 years ago

I was thinking (tweet) of making recommended mappings of Wikidata -> {BibTeX, CSL-JSON, ...} available from WikiCite. Maybe that's a good idea to do with this too. I can make a script to generate those mappings based on a mapping of CSL types to the most generic Wikidata equivalent, then assigning other Wikidata types to the nearest assigned parent.

I'll take these as a starting point, add entry-dictionary, which is the only CSL type missing, review if left-over generic types would fit for any of these, and then assign other generics to book, which is, I believe, the default CSL type. Feedback on this proces and the starting point would be welcome.

nichtich commented 5 years ago

Playing around with the existing mapping and subclass hierarchy of publication (Q732577) I realized several Wikidata classes in the mapping were not subclasses of publication and vice versa. The first can be fixed in Wikidat and does not affect citation-js. For the other direction major classes missing from the mapping include:

Q732577 (music releases such as CDs, singles, albums...)
Q207184 (press release)
Q333291 (abstract)
Q737498 (journal)
Q11032 (newspaper)
Q1228945 (working paper)
..

If CSL types had official URIs we could manage the mapping in Wikidata with Property equivalent class (P1709).

larsgw commented 5 years ago

Would Q386724 (work) or Q15401930 (product) be a better root in the meantime?

nichtich commented 5 years ago

At http://wikicite.org/statistics.html I use Q732577 as root. Does it make sense to cite other kind of entities? Actually this is a classical question in library and information science answered by Suzanne Briet in 1951 (I just created http://www.wikidata.org/entity/Q58378258 for the book). The answer is it depends on context. Nevertheless I would not start with complex cases but stick to Q732577 to begin with.

nichtich commented 5 years ago

Adding mappings to Wikidata I found that many classes are not publication types indeed but Q386724 (work) is a better root.

nichtich commented 5 years ago

First draft to get mappings to CSL publication types and hierarchy of work types and use this to map arbitrary Wikidata class ids given on command line:

const wdk = require('wikidata-sdk')
require('isomorphic-fetch')

const { sparqlQuery } = wdk

function getWikidataTaxonomy(root) {
  const sparql = `SELECT DISTINCT ?item ?broader WHERE {
      ?item wdt:P279+ wd:${root} .
      ?item wdt:P279 ?broader .
  }`

  return fetch(sparqlQuery(sparql))
    .then( response => response.json() )
    .then( wdk.simplify.sparqlResults )
    .then( result => result.reduce(
      (obj, row) => {
        if (obj[row.item]) {
          obj[row.item].push(row.broader)
        } else {
          obj[row.item] = [row.broader]
        }
        return obj
      },{}))
}

function getWikidataMapping(root, prefix) {
  const length = prefix.length
  const sparql = `
    SELECT DISTINCT ?item ?type WHERE  {
      ?item wdt:P279+ wd:${root} .
      ?item wdt:P2888|wdt:P1709 ?type .
      FILTER (SUBSTR(STR(?type),1,${length}) = "${prefix}")
    }`

  return fetch(sparqlQuery(sparql))
    .then( response => response.json() )
    .then( wdk.simplify.sparqlResults )
    .then( result => result.reduce(
      (obj, row) => {
        obj[row.item] = row.type.substring(length)
          return obj
        },{}))
}

const root = 'Q386724' // work
const prefix = 'https://citationstyles.org/ontology/type/'

// construct a lookup function
Promise.all([
  getWikidataMapping(root, prefix),
  getWikidataTaxonomy(root)
]).then( values => {
  const [mapping, taxonomy] = values

  return (qid) => {
    let queue = [qid]
    let visited = {}

    while (queue.length) {
      let id = queue.pop()

      if (id in mapping) {
        return mapping[id]
      } else if (id in taxonomy && !visited[id]) {
        visited[id] = true
        taxonomy[id].forEach( broader => {
            if (!visited[broader]) {
              queue.push(broader)
            }
        } )
      }
    }
  }
})
.then( wikidata2csl => {

  // read Wikidata ids line by line
  require('readline').createInterface({
    input: process.stdin,
    output: process.stdout,
    terminal: false
  }).on('line', function(line){
    console.log(wikidata2csl(line))
  })
})

I'm not sure when it makes sense to pre-calculate mapping table for each Wikidata class id that can be mapped to a CSL type instead of directly traversing the graph.

larsgw commented 5 years ago

What about things like Count of Barcelos, that show up because:

work -> electronic page -> web page -> MediaWiki page -> Wikimedia internal item -> Wikidata internal entity -> Wikidata item -> class or metaclass of Wikidata ontology -> fixed-order metaclass -> first-order metaclass -> rank -> royal or noble rank -> hereditary title -> count -> conde -> Count of Barcelos

86673 results seems a bit much...

But web page is not a subclass of publication, so that would not work either.

nichtich commented 5 years ago

The class hierarchy is a wiki so it will always contain arguable parts. But who cares if the parts you rely on are useful? In the given example I made web page subclass of document and some other modifications so the number of results is in the order of 1000s. I'd expect web site to be more citable by the way.

larsgw commented 5 years ago

Which results? I'm still getting 86 thousand for subclasses of work. web page has "exact match -> csl:webpage", should website have that then?

nichtich commented 5 years ago

The work taxonomy is quite large because it contains all kinds of works such as technical artifacts, coins, etc. If this is a problem we may use a subset of other root items. The current mapping to CSL types is covered by publication (Q732577), intellectual work (Q15621286), written work (Q47461344), legal case (Q2334719), review (Q265158) with ~15000 classes.

larsgw commented 5 years ago

This is what it reports:

{
  "undefined": 70995,
  "article": 49,
  "review": 6,
  "dataset": 359,
  "report": 41,
  "map": 143,
  "legal_case": 55,
  "webpage": 881,
  "patent": 10,
  "broadcast": 267,
  "book": 237,
  "manuscript": 62,
  "paper-conference": 2,
  "motion_picture": 322,
  "article-newspaper": 23,
  "song": 194,
  "chapter": 10,
  "interview": 16,
  "speech": 53,
  "treaty": 68,
  "bill": 10,
  "review-book": 1,
  "entry-dictionary": 3,
  "post": 2,
  "entry": 2,
  "article-magazine": 1,
  "musical_score": 5,
  "post-weblog": 3,
  "thesis": 32,
  "entry-encyclopedia": 1
}

A lot less web page then I expected given my entire terminal scrollback was filled with them. Dropping undefined, this seems like a pretty good set.

larsgw commented 5 years ago

There are some doubles BTW:

Q727715 book,manuscript
Q1050259 book,dataset
Q191072 map,dataset
Q2353983 map,manuscript
Q4202018 interview,article-newspaper
Q59908 article,article-newspaper
Q267628 article,article-newspaper
Q871232 article,article-newspaper
Q3694604 article,article-newspaper
Q3719255 article,article-newspaper
Q19375673 article,article-newspaper
Q19776345 speech,broadcast
Q1164267 book,article
Q6960620 book,article
Q1503133 book,article
Q2438528 article,report
Q26260507 broadcast,article
Q506240 motion_picture,broadcast
Q653916 motion_picture,broadcast
Q20088085 entry-dictionary,webpage
Q20088089 entry-dictionary,webpage
Q1400059 dataset,book
Q17633526 article-newspaper,webpage
Q854995 broadcast,motion_picture
Q20136634 article,webpage
Q11086742 motion_picture,broadcast
Q10885494 article,paper-conference
Q3962157 map,dataset
Q16825889 map,dataset
Q690851 manuscript,book
Q2321734 motion_picture,broadcast
Q26225677 motion_picture,broadcast
Q914229 article,article-newspaper
Q5465451 article,article-newspaper
Q25054829 dataset,webpage
Q59191021 dataset,webpage
Q59248059 dataset,webpage
Q59248072 dataset,webpage
Q2933856 book,manuscript
Q26267864 dataset,webpage
Q457843 dataset,book
Q1371849 dataset,webpage
Q7999883 article,article-newspaper
Q6899707 map,manuscript
Q57987419 interview,article-newspaper
Q57987455 interview,article-newspaper
Q57987589 interview,article-newspaper
Q26225493 book,article
Q2983424 motion_picture,broadcast
Q124922 motion_picture,broadcast
Q4453959 motion_picture,broadcast
Q23368955 motion_picture,broadcast
Q55848868 motion_picture,broadcast
Q17438413 dataset,webpage
Q17146139 map,webpage
Q11396323 motion_picture,broadcast
Q21759196 motion_picture,broadcast
Q7864671 motion_picture,broadcast
Q240862 motion_picture,broadcast
Q914242 motion_picture,broadcast
Q5338721 motion_picture,broadcast
Q26225765 motion_picture,broadcast
Q55936401 motion_picture,broadcast
Q220898 motion_picture,broadcast

davidar commented 5 years ago

citation-js/citation-js#5 seems to have caused a regression for academic journal article, which was previously mapped to article-journal but is now mapped to just article (in fact nothing seems to map to article-journal anymore)

larsgw commented 5 years ago

Yep, article-journal wasn't mapped in Wikidata yet, but I added it a few days ago. I'll re-build the index (which should include some other improvements like Portuguese noble titles not being a subclass of web page anymore) and include it in the next release.

larsgw / citation.js

Include subtypes in Wikidata types to CSL mapping #166