howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Remaining issues/lack of annotation consistency #638

Open kermitt2 opened 5 years ago

kermitt2 commented 5 years ago

Here are some remaining issues we observed in the current annotation scheme:

calculates the area on the binary images that can be produced by <rs type="software">MathWorks</rs>.

MathWorks is the company developing MATLAB (the correct name was introduced at the beginning of the paper, but a strange shift in referring expressions happened in the middle of the paper!). It is hard to decide how to annotate this case.

signals detected using <rs type="software" xml:id="PMC2649809-software-35">Affymetrix microarray suite</rs> version <rs corresp="#PMC2649809-software-35" type="version">5</rs> software (MAS5) for each probe were averaged over 21 caudate nucleus.

We leave "MAS5" (the acronym of "Affymetrix microarray suite") unannotated while it could be valuable for disambiguation. Currently software name are always considered as a continuous chunk.

As an improvement, we could use non-continuous software name annotation like this:

signals detected using <rs type="software" xml:id="PMC2649809-software-35">Affymetrix microarray suite</rs> version <rs corresp="#PMC2649809-software-35" type="version">5</rs> software (<rs corresp="#PMC2649809-software-35" type="software">MAS5</rs>) for each probe were averaged over 21 caudate nucleus.
We used the <rs type="software">MATLAB</rs> command <rs type="software">fmin- search</rs> with multiple starting points to compute the maximum likelihood estimate for this value.

Thus, linear regression with robust standard errors using the <rs type="software">STATA</rs> command "cluster (cluster variable)"was used-which relaxes the independence assumption and requires only that the observations should be independent across the clusters (STATA 2013).

We observed this case as encoded as another software entity (as above first example), sometimes both together in one, sometimes only the framework is annotated (as above second example). This case is not frequent and we have not fixed an annotation rule for this yet.

<rs corresp="#PMC0000000-software-1" type="creator">Microsoft</rs> <rs type="software" xml:id="PMC0000000-software-1">Excel</rs>

However we have not considered for the moment the "GraphPad Prism" case, where the name of the software is actually Prism and its editor is GraphPad, so it should normally be annotated like the "Microsoft Excel" case.

<rs id="software-1" type="software">GraphPad Prism</rs> <rs corresp="#software-1" type="version">5</rs> software (<rs corresp="#software-1" type="creator">GraphPad Software, Inc</rs>., La Jolla, CA, USA).

Similarly "Lotus Notes" is always identified as such, and not as "notes" from Lotus Inc. (although it is now called IBM Notes, but it's another story). So here unconsistencies remain for the moment.

caifand commented 5 years ago

Several responses here:

kermitt2 commented 5 years ago

Hi @caifand

About the first point, I've went through the creators and out of 1120 creator annotations, there are only 15 "person" creators (1,3%). I marked them with an attribute @subtype="person" in the "packaged" format (https://github.com/Impactstory/software-mentions/blob/master/resources/dataset/software/corpus/all.clean.tei.xml). I could use entity-fishing on the "software publishers" and try to link them with Wikidata (it might be richer than arcGIS and will takes just a few seconds).

caifand commented 5 years ago

Cool, thanks! By the way, how do you work with tei xml? In python?

kermitt2 commented 5 years ago

Yes python has nice library for reading and manipulating XML (much easier to use than the Java ones I think), for instance ElementTree is a standard Python library or lxml which requires a dependency but is more complete.

Then I have to say working in general with XML remains painful by design ;) But when it comes to representing a complete structured document, XML can't really be avoided imho.