Closed nichtich closed 5 years ago
I was thinking (tweet) of making recommended mappings of Wikidata -> {BibTeX, CSL-JSON, ...} available from WikiCite. Maybe that's a good idea to do with this too. I can make a script to generate those mappings based on a mapping of CSL types to the most generic Wikidata equivalent, then assigning other Wikidata types to the nearest assigned parent.
I'll take these as a starting point, add entry-dictionary
, which is the only CSL type missing, review if left-over generic types would fit for any of these, and then assign other generics to book
, which is, I believe, the default CSL type. Feedback on this proces and the starting point would be welcome.
Playing around with the existing mapping and subclass hierarchy of publication (Q732577) I realized several Wikidata classes in the mapping were not subclasses of publication and vice versa. The first can be fixed in Wikidat and does not affect citation-js. For the other direction major classes missing from the mapping include:
If CSL types had official URIs we could manage the mapping in Wikidata with Property equivalent class (P1709).
Would Q386724
(work) or Q15401930
(product) be a better root in the meantime?
At http://wikicite.org/statistics.html I use Q732577 as root. Does it make sense to cite other kind of entities? Actually this is a classical question in library and information science answered by Suzanne Briet in 1951 (I just created http://www.wikidata.org/entity/Q58378258 for the book). The answer is it depends on context. Nevertheless I would not start with complex cases but stick to Q732577 to begin with.
Adding mappings to Wikidata I found that many classes are not publication types indeed but Q386724 (work) is a better root.
First draft to get mappings to CSL publication types and hierarchy of work types and use this to map arbitrary Wikidata class ids given on command line:
const wdk = require('wikidata-sdk')
require('isomorphic-fetch')
const { sparqlQuery } = wdk
function getWikidataTaxonomy(root) {
const sparql = `SELECT DISTINCT ?item ?broader WHERE {
?item wdt:P279+ wd:${root} .
?item wdt:P279 ?broader .
}`
return fetch(sparqlQuery(sparql))
.then( response => response.json() )
.then( wdk.simplify.sparqlResults )
.then( result => result.reduce(
(obj, row) => {
if (obj[row.item]) {
obj[row.item].push(row.broader)
} else {
obj[row.item] = [row.broader]
}
return obj
},{}))
}
function getWikidataMapping(root, prefix) {
const length = prefix.length
const sparql = `
SELECT DISTINCT ?item ?type WHERE {
?item wdt:P279+ wd:${root} .
?item wdt:P2888|wdt:P1709 ?type .
FILTER (SUBSTR(STR(?type),1,${length}) = "${prefix}")
}`
return fetch(sparqlQuery(sparql))
.then( response => response.json() )
.then( wdk.simplify.sparqlResults )
.then( result => result.reduce(
(obj, row) => {
obj[row.item] = row.type.substring(length)
return obj
},{}))
}
const root = 'Q386724' // work
const prefix = 'https://citationstyles.org/ontology/type/'
// construct a lookup function
Promise.all([
getWikidataMapping(root, prefix),
getWikidataTaxonomy(root)
]).then( values => {
const [mapping, taxonomy] = values
return (qid) => {
let queue = [qid]
let visited = {}
while (queue.length) {
let id = queue.pop()
if (id in mapping) {
return mapping[id]
} else if (id in taxonomy && !visited[id]) {
visited[id] = true
taxonomy[id].forEach( broader => {
if (!visited[broader]) {
queue.push(broader)
}
} )
}
}
}
})
.then( wikidata2csl => {
// read Wikidata ids line by line
require('readline').createInterface({
input: process.stdin,
output: process.stdout,
terminal: false
}).on('line', function(line){
console.log(wikidata2csl(line))
})
})
I'm not sure when it makes sense to pre-calculate mapping table for each Wikidata class id that can be mapped to a CSL type instead of directly traversing the graph.
What about things like Count of Barcelos, that show up because:
work -> electronic page -> web page -> MediaWiki page -> Wikimedia internal item -> Wikidata internal entity -> Wikidata item -> class or metaclass of Wikidata ontology -> fixed-order metaclass -> first-order metaclass -> rank -> royal or noble rank -> hereditary title -> count -> conde -> Count of Barcelos
86673 results seems a bit much...
But web page
is not a subclass of publication, so that would not work either.
The class hierarchy is a wiki so it will always contain arguable parts. But who cares if the parts you rely on are useful? In the given example I made web page subclass of document and some other modifications so the number of results is in the order of 1000s. I'd expect web site to be more citable by the way.
Which results? I'm still getting 86 thousand for subclasses of work. web page
has "exact match -> csl:webpage", should website
have that then?
The work
taxonomy is quite large because it contains all kinds of works such as technical artifacts, coins, etc. If this is a problem we may use a subset of other root items. The current mapping to CSL types is covered by publication (Q732577), intellectual work (Q15621286), written work (Q47461344), legal case (Q2334719), review (Q265158) with ~15000 classes.
This is what it reports:
{
"undefined": 70995,
"article": 49,
"review": 6,
"dataset": 359,
"report": 41,
"map": 143,
"legal_case": 55,
"webpage": 881,
"patent": 10,
"broadcast": 267,
"book": 237,
"manuscript": 62,
"paper-conference": 2,
"motion_picture": 322,
"article-newspaper": 23,
"song": 194,
"chapter": 10,
"interview": 16,
"speech": 53,
"treaty": 68,
"bill": 10,
"review-book": 1,
"entry-dictionary": 3,
"post": 2,
"entry": 2,
"article-magazine": 1,
"musical_score": 5,
"post-weblog": 3,
"thesis": 32,
"entry-encyclopedia": 1
}
A lot less web page
then I expected given my entire terminal scrollback was filled with them. Dropping undefined
, this seems like a pretty good set.
There are some doubles BTW:
Q727715 book,manuscript Q1050259 book,dataset Q191072 map,dataset Q2353983 map,manuscript Q4202018 interview,article-newspaper Q59908 article,article-newspaper Q267628 article,article-newspaper Q871232 article,article-newspaper Q3694604 article,article-newspaper Q3719255 article,article-newspaper Q19375673 article,article-newspaper Q19776345 speech,broadcast Q1164267 book,article Q6960620 book,article Q1503133 book,article Q2438528 article,report Q26260507 broadcast,article Q506240 motion_picture,broadcast Q653916 motion_picture,broadcast Q20088085 entry-dictionary,webpage Q20088089 entry-dictionary,webpage Q1400059 dataset,book Q17633526 article-newspaper,webpage Q854995 broadcast,motion_picture Q20136634 article,webpage Q11086742 motion_picture,broadcast Q10885494 article,paper-conference Q3962157 map,dataset Q16825889 map,dataset Q690851 manuscript,book Q2321734 motion_picture,broadcast Q26225677 motion_picture,broadcast Q914229 article,article-newspaper Q5465451 article,article-newspaper Q25054829 dataset,webpage Q59191021 dataset,webpage Q59248059 dataset,webpage Q59248072 dataset,webpage Q2933856 book,manuscript Q26267864 dataset,webpage Q457843 dataset,book Q1371849 dataset,webpage Q7999883 article,article-newspaper Q6899707 map,manuscript Q57987419 interview,article-newspaper Q57987455 interview,article-newspaper Q57987589 interview,article-newspaper Q26225493 book,article Q2983424 motion_picture,broadcast Q124922 motion_picture,broadcast Q4453959 motion_picture,broadcast Q23368955 motion_picture,broadcast Q55848868 motion_picture,broadcast Q17438413 dataset,webpage Q17146139 map,webpage Q11396323 motion_picture,broadcast Q21759196 motion_picture,broadcast Q7864671 motion_picture,broadcast Q240862 motion_picture,broadcast Q914242 motion_picture,broadcast Q5338721 motion_picture,broadcast Q26225765 motion_picture,broadcast Q55936401 motion_picture,broadcast Q220898 motion_picture,broadcast
citation-js/citation-js#5 seems to have caused a regression for academic journal article, which was previously mapped to article-journal
but is now mapped to just article
(in fact nothing seems to map to article-journal
anymore)
Yep, article-journal
wasn't mapped in Wikidata yet, but I added it a few days ago. I'll re-build the index (which should include some other improvements like Portuguese noble titles not being a subclass of web page anymore) and include it in the next release.
The mapping from Wikidata types and CSL document types lacks a lot of Wikidata types. Wikidata has more than 1000 document types to support (including several false positives) so the full list should be updated automatically and stored in a JSON file.
P.S: This SPARQL query gives the publication type hierarchy from Wikidata. We only need to hard-code mapping to CSL document types for broad concepts so we can derive all subtypes.