mitar / bib2wikidata

Upload citation data to Wikidata
13 stars 0 forks source link

Getting started #1

Open Daniel-Mietchen opened 10 years ago

Daniel-Mietchen commented 10 years ago

csl2wikidata sounds interesting, and I would like to try it out by uploading the references (all, or at least the openly licensed ones) cited in https://en.wikipedia.org/wiki/Malaria to Wikidata.

However, I could not figure out how to get started, so some guidance on that would be appreciated.

mitar commented 10 years ago

It is not yet finished. Where we stopped was that we couldn't find a good format as input. First we thought to use CSL, but then CSL in fact is just a style-sheet for citations, which does define a input format, but it is not so well defined or at least I have not find a nice way/code to parse it and get out some document/object I could then use to push into wikidata. If you have some suggestion, please help.

We are planing to continue working on this at https://wikimania2014.wikimedia.org/wiki/Hackathon/Citathon, but we could now discuss what to use. So, I would need a library which takes bibliographic entry in some format and produces some standardized object I can then use to make API calls to wikidata.

Here we started discussing what such an input format should be:

https://etherpad.mozilla.org/TCrCIcyEDL

cc @jure

Daniel-Mietchen commented 10 years ago

What I have in mind is the following steps:

  1. generate list of DOIs cited in https://en.wikipedia.org/wiki/Malaria
  2. look up the respective metadata via the CrossRef API
  3. (if necessary) convert the metadata format into format that csl2wikidata can ingest
  4. use the Wikidata API to create Wikidata items for each of these bibliographic items, adding the metadata using the appropriate properties.

I set up a sample item about a research article under https://www.wikidata.org/wiki/Q15625490 . We (pinging @notconfusing and @wrought) want to use the Malaria ones in demo of OA signalling on Friday morning (cf. https://wikimania2014.wikimedia.org/wiki/Submissions/Marking_open-access_references_cited_on_Wikipedia ). All three of us shall be at the Citathon.

mitar commented 10 years ago

So yes, please help me find a JavaScript library for step 3. :-)

HLHJ-zz commented 10 years ago

Steps 1-3 are already automated in Zotero; it has a function where you paste in a list of DOIs and it checks Crossref and returns all the metadata. Zotero can export (and import) Wikipedia Citation Templates, BibTeX, BibLateX, RefWorks, MODS, COinS, Citation Style Language/JSON, Refer/BibIX, RIS, TEI, Evernote, EndNote, Bibliontology RDF, Bookmarks, Unqualified Dublin Core RDF, and Zotero RDF. Mvolz borrows the Zotero javascript libraries to do this in Citoid, I believe.

I'd suggest making Zotero do step 4, too. There is interest in the Zotero community at making it interact with Wikidata, as it already interacts with Google Scholar and Crossref. https://forums.zotero.org/discussion/36151/wikified-copyleft-bibliographic-database/ It would also make it easy for anyone already using Zotero to contribute. I have hundreds of papers in mine, many with metadata proofread by me; if I could upload them with a click, I would, regularly.

Since some of Google Scholar's and even the publishers' metadata has errors, it might be necessary to maintain errata (so it can automatically ignore repeated uploads of this data), and skip/manually merge fields that are already uploaded, so as not to overwrite good data with bad.

Should the default dump the data in a bot's userspace, if you don't configure it with your own Wikimedia account details? Should via-a-standard-bot be the only option? The exact UI would hardly matter for a beta version.

Mitar, you asked for fields we might need: there are the ones already generally available, then there are those we might want to add. Apart from the list on the etherpad, which was just a list of CSL fields that corresponded to Wikidata fields, I don't know that we made one. I was writing one a few days ago, which I post; it's not very well-thought-out yet, please comment.

Standard fields

(these are the ones Zotero already has for journal articles, lightly modified for database form)

type = journal article
abstract = (fair use to use in catalogues, by long-standing custom)
doi =
issn = 
volume = 
issue = 
pages = 
author(s) -> link to separate entries, merge manually later
        In an author entry:
    -- last name (problem here with, e.g., Chinese names)
    -- first name
    -- other names (people often publish under different names, should accept all scripts)
    -- institution -> link to
    -- contact info (usually an e-mail in a publication, maybe skip this one on Wikidata)
    -- urls of personal/institutional website(s)
title = 
journal -> link to (should contain journal abbreviation; there are databases of these. The SHERPA ROMEO and/or DOAJ databases might give you their data.)
language =
date = 
series -> fields like title, editor...
url of copy of record (the thing the DOI resolves) = 
archive->
catalogue->
call number= (presumably for one of the above)

Fields that might need modifying

Fields that might be desirable

There seem to be some publishers who claim that the citations in the bibliography of a scholarly article cannot be reproduced online under fair use. Others disagree. Presumably Wikimedia has professionals who could advise. It would be a really useful field, and can certainly be added for OA articles under CC-0, CC-BY, or CC-BY-NC.

HLHJ-zz commented 10 years ago

A field for linking to an open lab notebooks containing the raw data of the study, probably as URL(s), would also be useful.

ghost commented 10 years ago

would it be better to be comprehensive with the choice of fields, as you have tried to do above, or be selective in order to make it easier to start?

if we avoid anything with potential copyright issues to start with - abstract, citations etc. - this would mean less to worry about while getting the project started. it might mean revisiting data at a later point to add more fields but that is probably not a big deal.

On 3 August 2014 21:14, HLHJ notifications@github.com wrote:

A field for linking to an open lab notebooks containing the raw data of the study, probably as URL(s), would also be useful.

— Reply to this email directly or view it on GitHub https://github.com/mitar/csl2wikidata/issues/1#issuecomment-51001604.

mitar commented 10 years ago

I had some progress with an app, but then I started integrating OAuth and now I am waiting for few pull requests to be pulled in:

mitar commented 10 years ago

We worked a bit more on this at PLOS citations hackathon event and here are few notes.