howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Creator as part of the bibliographical reference callout #612

Open kermitt2 opened 5 years ago

kermitt2 commented 5 years ago

Annotations of creator within a bibliographical reference callout seems random:

10.1002%2Fpam.22030.software-mention.xml

Data come from the Integrated Public Use Microdata Series 
(<rs type="software">IPUMS</rs>) database (Ruggles et al., 2010). 

10.1007%2Fs00191-010-0188-y.software-mention.xml

... the estimation method employed here is based on the 'recursive conditioning simulator' 
implemented for <rs type="software">STATA</rs> by Cappellari and Jenkins (2003). 

10.1007%2Fs10290-016-0264-y.software-mention.xml

with a maturity of 10 years found in the 
<rs type="software">Statistical Data Warehouse</rs> 
of the European Central Bank (2014). 

10.1007%2Fs10663-015-9287-1.software-mention.xml

... using <rs type="software">OpenBugs</rs> program of Meyer and Yu (2000).

versus

10.1007%2Fs10683-017-9548-x.software-mention.xml

using the recruitment software <rs id="software-2" type="software">ORSEE</rs> 
(<rs corresp="#software-2" type="creator">Greiner</rs> 
<rs corresp="#software-2" type="version-date">2015</rs>). All sessions were 
programmed with the <rs id="software-3" type="software">z-Tree</rs> 
(<rs corresp="#software-3" type="creator">Fischbacher</rs> 
<rs corresp="#software-3" type="version-date">2007</rs>) software. 

-> tagging version date using the publication year of the introduced bibliographical reference is not very well-funded imho.

10.1007%2Fs10258-013-0091-1.software-mention.xml

using the package <rs id="software-2" type="software">MulCom</rs> of 
<rs corresp="#software-2" type="creator">Hansen and Lunde</rs> 
(<rs corresp="#software-2" type="version-date">2010</rs>) written in 
<rs id="software-3" type="software">Ox</rs> 
(<rs corresp="#software-3" type="creator">Doornik</rs> 
<rs corresp="#software-3" type="version-date">2006</rs>). 

10.1007%2Fs11166-011-9127-z.software-mention.xml

The experiment was programmed in <rs id="software-1" type="software">Z-tree</rs> 
(<rs corresp="#software-1" type="creator">Fischbacher</rs> 
<rs corresp="#software-1" type="version-date">2007</rs>). 
jameshowison commented 5 years ago

Hmmm, yes I see the issue here. Certainly the dates on bibliographical reference callout should not be coded as version-date. So we could programmatically remove any version-date codes that overlap with a bibliographical reference callout (assuming that TEI XML marks the call outs in some way?)

OTOH, it is appropriate to code the names in a bibliographical reference callout as creator (on the logic that having their names there gives them credit for the code). So that seems harder to fix. Perhaps programmatically, though, we could identify any bibliographical reference callout in a sentence with a mention (ie a software name). Then we could either a) auto-code any names with those as creator, or b) manually review to know which are actually the creators of the software.

For creator, though, I deeply suspect that without looking at the article cited specifically, a great many of the names in bibliographical reference callouts would not have been marked as creators. ie the information used to disambiguate whether any particular bibliographical reference callout with a name is outside the information given to the machine learning system (because we're not actually giving the text of the cited paper which the coders read/skimmed to make their decision). So that agues for dropping any creator tags in a bibliographical reference callout.

In terms of what we need for CiteAs etc, I think the creator is low priority (since we can work via the Grobid recognized bibliographical reference callout), so I think best route to consistency is to drop the creator tags in a bibliographical reference callout.