howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Additional corrections and rules for consistent annotations #637

Open kermitt2 opened 5 years ago

kermitt2 commented 5 years ago

We describe here the additional principles used to make the annotations more uniform in the release candidate.

using the <rs type="software">Ingenuity Pathways Analysis (IPA)</rs> tool <ref type="figure">(Figure 3</ref>b). 
The user interface was linked to a <rs type="software">My Structured Query Language (MySQL)</rs> database 

This annotation pattern is now systematically used (it was the most frequent pattern already in the existing annotations, but not followed in around one third of the cases).

for instance in 10.1111/j.1467-8381.2009.02008.x

The CGE model we describe in this section is the <rs id="software-1" type="software">GTAP6inGAMS</rs> model developed by <ref type="bibr">Rutherford (2005)</ref>.

was corrected:

The CGE model we describe in this section is the GTAP6inGAMS model developed by <ref type="bibr">Rutherford (2005)</ref>.

The software that uses the model is called GAMS.

using the <rs type="software">R package proCIs</rs>

The reasons are non-uniform annotations and facilitating further disambiguation/matching of the software package.

Convertible tablets, for which <rs type="software">Windows</rs> was the dominant operating system, are being overwhelmed by the new popularity of slate tablets. 
Data analysis and model fitting were performed using <rs type="software">custom scripts</rs> written in <rs id="software-1" type="software">Igor Pro</rs> <rs corresp="#software-1" type="version">6</rs> (<rs corresp="#software-1" type="creator">WaveMetrics</rs>).</p>

Second, since <rs type="software">Matlab</rs> <rs id="software-0" type="software">routines</rs> applying Bayesian methods to the spatial lag, spatial error and spatial ...

We considered that it's interesting to detect them, even if they won't be matched against an existing software in the populated knowledge base.

for instance in PMC4927739, cathodoluminescence (CL) is a physical reaction used as a measurement method. CL is wrongly annotated as a software in the following:

Start the <rs type="software">CL</rs> control program 

The software is not named, it controls the cathodoluminescence in order to make a measurement.

Same for cross-correlation electron backscatter diffraction (ccEBSD), wrongly annotated here:

Open the <rs type="software">EBSD</rs> control software and load the calibration file for the chosen WD. 8. Set up the measurement in the <rs type="software">EBSD</rs> control software according to the operating manual

The software is not named EBSD. There is a software mention ("control software"), but it remains unnamed.

Some wrong annotation of methods as software are obvious (PMC1933234):

Several software were used by different authors to identify these particular repeats but usually a manual discard of background was necessary, and generally some <rs type="software">CRISPR</rs> clusters were missed or neglected, especially the shortest one (less than three motifs).

However, exhaustiveness was hard to achieve in a reasonable time and there are still some databases and ambiguous cases annotated as software in the current version.

<rs corresp="#PMC0000000-software-1" type="creator">Microsoft</rs> <rs type="software" xml:id="PMC0000000-software-1">Excel</rs>
<p>Observed heterozygosity was estimated in Microsoft <rs type="software" xml:id="PMC4103605-software-13">Excel</rs> (<rs corresp="#PMC4103605-software-13" type="creator">Microsoft Corporation</rs>, Redmond, Washington, USA).</p>

This annotation pattern is normally used systematically now (previously for these cases, the annotations of one "Microsoft" or the other was very random).

caifand commented 5 years ago

Regarding the model example, I personally believe GTAP6inGAMS is actually a software package. There does exist a model called GTAP6; GTAP6inGAMS is the implementation of the GTAP6 model in the GAMS environment... Here's their package web page

But truly, model is often a boundary case of software instantiation. And you need to do a lot of background research to get it accurate. And perhaps carefully read the context to figure out in EACH sentence it denotes the abstract numeric/statistical model or the implementation in code...

caifand commented 5 years ago

Seems to me that we can keep annotating operating systems in the way that you handled them in the corpus file. i.e., to make it an explicit rule that we only annotate OS when it is mentioned as an individual case.

Another question is would we want to keep annotating non-named software/scripts? Currently they are not consistently annotated in the dataset. When annotating those in the future we can leave the software_name of these instances as empty but annotate it as a software mention.

kermitt2 commented 5 years ago

I left non-named software/scripts not annotated because we can't match them with existing software database, we can't deduplicate them over mentions in several documents and people can't cite them... so I think it's a bit impossible to exploit them and there is no possible credit to be given for such a non-named software.