howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Scope for "software" #511

Open kermitt2 opened 5 years ago

kermitt2 commented 5 years ago

Looking at PMC1747177, in softcite_code_applied.csv we have GenomeRNAi as software name:

selection,coder,code,was_code_present,code_label
PMC1747177_MS02,mjsong1201,software_was_used,true,
PMC1747177_MS02,mjsong1201,software_name,true,GenomeRNAi
PMC1747177_MS02,mjsong1201,version_number,false,
PMC1747177_MS02,mjsong1201,version_date,false,
PMC1747177_MS02,mjsong1201,url,false,
PMC1747177_MS02,mjsong1201,creator,false,

GenomeRNAi is a database, so I would say naively it is not a software. I also saw programing languages (perl, Java, ...) annotated as a software. What is the policy for the definition of a software?

jameshowison commented 5 years ago

Here's the guidance through the coding scheme:

https://howisonlab.github.io/softcite-dataset/coding-scheme.html

Students are encouraged to google etc to make a call. And yes, there are going to be times when that call isn't right :) I wonder how much it's going to matter from the perspective of the machine learning, clearly something in the text made the student select this mention. I wonder if identifying all the mentions (including other than software) makes the most sense. There is also the certainty field. My thinking is that we turn that into a standardized score for each coder (some say 8 when they are really sure, some say 10) and experiment with dropping some lowest certainty for each coder?

kermitt2 commented 5 years ago

The problem of databases coded as softwares has likely a pretty big impact on the learning process because databases are coded in an inconsistent manner. The same database will be coded as software in one paper by one annotator and not coded at all in another paper. So these examples will bring a lot of noise in the learned textual context of "softwares", problem for discriminating and generalizing these contexts.

For instance, the SCOP database is coded as software in PMC1538888 and it is never coded in PMC1636350.

From what I saw, usually databases are not coded as software and it is probably just a few cases. Here are some cases I spotted so far:

SCOP in PMC1538888 GenomeRNA in PMC1747177 ASPicDB, ASD, ASAP, ASTALAVISTA, H-DBAS, PDB, the Protein Data Bank in PMC3013677 UniprotKB/Swissprot in PMC3013677 COSMIC in PMC3630384 piRNABank, GEO in PMC4066774 IPI mouse protein database in PMC4959023 PubMed database, Ensemble genome database, lncRNAdb, PomBase, FlyBase, TAIR, DOT in PMC5210605

kermitt2 commented 5 years ago

Are programming languages a software? Annotations for them are inconsistent, for instance:

Sometime the annotation is not very precise I think:

PMC1635254:

the array platform sequences and wrote the <rs type="software">Perl</rs> scripts used

The software here would be the scripts written in Perl, not Perl the programming language. In the other documents, the mentions of "Perl scripts" are never annotated as a whole or in part.

Programming languages are probably rather an attribute of a software than a software per se? It's the kind of attribute information we find in Wikidata for a software entity. So similarly as we annotate attribute like url or creator, we could have the attribute type programming language?

jameshowison commented 5 years ago

Yes, I think we can adjust the coding scheme to help a little here. @caifand could you make these changes to the coding scheme and check them in?

  1. One difficult judgement is around whether databases should be coded as software. The relevant distinction should be whether the text is referring to a data collection/ dataset (ie the data in the database) or to the software that provides access to a dataset. If it is clear that they are referring to the data inside, then mark that as "Dataset" and don't code further. If it is unclear, then mark that as software but consider lower certainty.

  2. Programming languages are themselves software, as well as something used to create software. So if a mention is of a script written with a language, then that should be coded as two mentions, one for the language, one for the script. e.g., "We used Perl scripts to ..." Then that should be two mentions, one for "Perl" and one for the scripts. The one for the scripts, however, should be coded with no software_name.

caifand commented 5 years ago

Hi @jameshowison, I've updated the coding scheme but have a question regarding your first clarification. If we suggest people code unclear mentions of database as "software" with low certainty score, does it imply that, we prefer more entries of "software" with varied certainty scores in our training set when compared to entries of other items? Is this thought as a general rule for all the confusing mention types, like a seeming piece of hardware, or algorithm, etc.