howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

adding missing annotations to corpus #670

Open caifand opened 4 years ago

caifand commented 4 years ago

data/corpus/softcite_corpus.tei.xml contains the following articles that have no software annotations in them:

article_with_no_mention_in_softcite_corpus_tei_xml.csv 10.20955%2Fr%2F2018.1-16 10.1257%2Fjep.4.1.99 PMC4863732 10.1002%2Fpam.22030 10.1257%2Fjep.24.4.187 PMC1538888 PMC4644012 10.1007%2Fs10290-016-0264-y 10.1111%2Fcaje.12091 10.1111%2Fjems.12230 10.1257%2F089533002320951064 PMC3371863

Here is a list of manual annotations in the articles listed above, extracted from our rdf data to be added to the TEI XML corpus. I've noticed that quite a few of them are annotated as software entities with low certainty score.

Particularly the articles 10.1257%2F089533002320951064 & PMC4863732 (currently in softcite_corpus.tei.xml) only have annotations that are not software entities (marked as web_platform during our review). So we actually can remove these two articles from softcite_corpus.tei.xml and perhaps other corpora correspondingly.

@kermitt2 Could you help take a look? Thanks!

kermitt2 commented 4 years ago

Hi @caifand !

Thank you for tracking all these removed annotations. I actually indeed removed them all during my full corpus review. However I left the un-annotated texts in the TEI XML file, as a way to have a bit of interesting negative examples.

Here is a quick review of these cases to explain why I removed them:

"The Ethereum network is currently the leader in the field of smart contracts."

Ethereum is a crypto currency, as bitcoin, so I considered that it is similarly not a software. Ethereum is a complete infrastructure with virtual machines, network, etc. So as Bitcoin is not annotated, it would be inconsistent to annotate as software Ethereum.

However, a particular software of these infrastructure are annotated -> see document 10.20955%2Fr.2018.1-16 "Bitcoin wallet" is actually a software used to store Bitcoins and is annotated.

"It is written in BASIC, a close analogue to FORTRAN."

As a general rule, I did not annotate software language per se (written in BASIC, in FORTRAN, ...), but software tools for a software language (like a C compiler, a Java virtual machine, etc.)

PubMed MEDLINE is a database/web platform (it refers to the data and the service)

"Data come from the Integrated Public Use Microdata Series (IPUMS) database"

Integrated Public Use Microdata Series (IPUMS) is a database and an online platform.

"Two kinds of web applications have a presence in the market. Some depart- wo kinds of web applications have a presence in the market. Some departments are in institutions that use a university-wide platform,"

Weird stuff... it talks about web platform and this is very generic, in particular we are not talking about a specific software.

""already in SCOP90 (SCOP version 1.55,<90%sequence identity non-redundant set)""

What was annotated is "SCOP" and this is a database - Structural Classification of Proteins (SCOP) database.

""The Gram-negative coccobacilli were initially identified as Pasturella pneumotropica by the VITEK 2 system, software version 06.01 (BioMerieux, France) using the GN card, with bionumber 0001010210040001 and an excellent identification (probabil-ity99%).""

This one is tough, I remember! What is annotated is not a software, it's a medical/laboratory device that contains a software. https://www.biomerieux-usa.com/clinical/vitek-2-healthcare This kind of device names are not annotated in the rest of the corpus, so basically I did the same... the issue with this particular snippet is that the mention of the device includes a software mention number (it's implicitly the version number of the software of the device).

But other mention of the same device are not annotated:

"Analysis performed using VITEK 2 (BioMerieux, France) and agar dilution as per the Clinical and Laboratory Standards Institute (1)"

So I have prioritized to be consistent with the rest of the corpus.

"Source: Datastream, own calculations"

Datastream is a database. There is no mention how the "own calculations" have been made on the data by the author.

"Specifically, the data are CIF imports measured in US$, taken from International Financial Statistics’ Direction of Trade CD-ROM, deflated by U.S. CPI for All Urban Consumers (CPI-U), all items, 1982 to 1984 = 100."

"Direction of Trade CD-ROM" -> not a software

"This information, communicated via WOM or eWOM ,1"

The footnote indicates that this is not a software but a generic name for social web platforms talking about commercial products.

1 WOM refers to product-related commentary shared between friends, family, neighbors, etc. Moreover, advances in information technology and the digital revolution both facilitate and amplify the exchange of information on products via social networking sites and other online fora, such as Facebook, Twitter, forums, blogs, etc., referred to as eWOM.

Elsevier "ScienceDirect" is a web platform.

"Table 7 summarizes the TAIR 9 annotations (TAIR, 2009) for allthree groups of a total of 226 predicatedASM regions."

I considered TAIR as a database (https://www.arabidopsis.org/index.jsp). There are tools/software for exploiting the TAIR database (see tools on the web site), so there is ground to distinguish them from the database.