howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Consistency: how much to put into creator? #613

Closed kermitt2 closed 5 years ago

kermitt2 commented 5 years ago

One consistency issue I have observed is the size of the chunk corresponding to creator. When creator is actual the software publisher, its address can be included or not in an unpredictable manner.

Examples:

PMC4435018_FJ01,jasleen29,creator,true,"Tom Hall, Ibis Biosciences, Carlsbad, CA"

PMC2927682

Analysis was conducted using <rs id="software-4" type="software">STATA</rs> 
<rs corresp="#software-4" type="version-number">V.9.2</rs> 
(<rs corresp="#software-4" type="creator">Stata, College Station, Texas, USA</rs>).
... with <rs id="software-2" type="software">Origin</rs> 
<rs corresp="#software-2" type="version-number">7.5</rs> software 
(<rs corresp="#software-2" type="creator">Origin- Lab, Northampton, MA, U.S.A</rs>.) 

versus

PMC2921509:

followed by the Tukey-Kramer post hoc test performed with 
<rs id="software-0" type="software">GraphPad prism</rs> software 
(<rs corresp="#software-0" type="version-number">version 4.0</rs>, 
<rs corresp="#software-0" type="creator">GraphPad Software</rs>, San Diego, CA, USA).
<rs id="software-0" type="software">SPSS</rs> 
<rs corresp="#software-0" type="version-number">ver. 11.0</rs> 
(<rs corresp="#software-0" type="creator">SPSS Inc.</rs>., Chicago, IL, USA) 
was used to evaluate the data.
All the analysis was performed in the <rs id="software-0" type="software">MATLAB</rs> 
environment (<rs corresp="#software-0" type="creator">The MathWorks</rs>, Natick, MA)
jameshowison commented 5 years ago

Thanks. Yes, the address should have been included. Sigh.

We could review all manually (but it's over 2,500 creator labels), but perhaps there is a reasonable way to look at the text immediately following a creator and see if the next few words include a geographic entity (ie match US state abbreviations or country names). Perhaps the geographic entity recognizer is already available?

jameshowison commented 5 years ago

In general, though, I think the creator label is the least important of all the labels, so perhaps we should prioritize the other labels.

caifand commented 5 years ago

So @kermitt2 does not include the address into creator in the candidate release. May we make it a rule ("not include address into your creator annotation") and close this issue now?

jameshowison commented 5 years ago

Yes. Decision was because addresses can be geolocated as separate entities, so no need to include them in creator. Or viewed another way they are metadata about the creator, not part of its name.