howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Consistency: annotate implementation & environment as separate entities? #642

Open caifand opened 4 years ago

caifand commented 4 years ago

This example is from #637

The 95% CIs of the difference of percentage changes were evaluated using the <rs type="software">R package proCIs</rs>.

Here R package proCIs is annotated as one single entity in this context. While in most of the cases, the package and the environment are separately annotated. For instance:

Hierarchical clustering and heatmap plots were generated with <rs type="software" xml:id="PMC4478705-software-3">R</rs> (<rs corresp="#PMC4478705-software-3" type="creator">R Development Core Team</rs>, <rs corresp="#PMC4478705-software-3" type="version">2012</rs>) using the library '<rs type="software">seriation</rs>'

<rs type="software">Monmlp</rs> is the implementation of ANN in <rs type="software">R</rs>.

Thus, <rs type="software">rgp</rs> is an implementation of GP methods in the <rs type="software">R</rs> environment. <ref type="bibr">29</ref> Package <rs type="software">rgp</rs> results are simple representations of the problem without being exposed to a priori information.

The package <rs type="software">fscaret</rs> allows semiautomatic feature selection, working as a wrapper for the caret package in <rs type="software">R</rs>.

(The final one has a package name missing annotation here :)

To me it's reasonable to annotate the software environment and the package separately.

caifand commented 4 years ago

A more granular example is commands/functions in the programming environment. They are close to individually authored scripts. They are not consistently annotated in the dataset at the moment (but the support language environment is often annotated). Some existing examples:

We used the <rs type="software">MATLAB</rs> command <rs type="software">fmin- search</rs> with multiple starting points to compute the maximum likelihood estimate for this value.
 linear regression with robust standard errors using the <rs type="software">STATA</rs> command "cluster (cluster variable)"was used-which relaxes the independence assumption and requires only that the observations should be independent across the clusters (STATA 2013)

Would we want to leave them to crowd judgment?

caifand commented 4 years ago

Similarly, the concern about annotating programming language may be addressed in this category of issues because:

  1. they are not consistently annotated in the current dataset.
  2. @kermitt2 keeps those ones referring to an implementation framework in the dataset and excludes ones not serving as this function. (If I understand it correctly)

Then what about Java in this case?

<p>The Java GUI interface of <rs type="software">FastPval</rs> is shown in <ref type="figure">Supplementary Figure S</ref>2a-c. In the 'Method' field, the user can either choose '<rs type="software">FastPval</rs>' or the traditional 'Exact' method to calculate P-values.

Thinking about the future annotation, the way we currently include these as valid annotations is still subject to subjective interpretation. (i.e., whether people understand the programming language as some sort of framework? They need to interpret the function of the programming language as implied in the textual context first). Though we can give some examples to prompt such understanding.

caifand commented 4 years ago

The same issue for the mentions of the non-named "chunks of code" implemented in a certain software environment (borrowed from #637 ):

Data analysis and model fitting were performed using <rs type="software">custom scripts</rs> written in <rs id="software-1" type="software">Igor Pro</rs> <rs corresp="#software-1" type="version">6</rs> (<rs corresp="#software-1" type="creator">WaveMetrics</rs>).</p>

Second, since <rs type="software">Matlab</rs> <rs id="software-0" type="software">routines</rs> applying Bayesian methods to the spatial lag, spatial error and spatial ...
jameshowison commented 4 years ago

I think the principle applied here should be socio-technical :) Ultimately we are interested in improving credit for software contributions, including motivating sharing and coalescence.

So I see three general categories (which have different names in different ecosystems).

  1. "Included code" Code that is part of some other code, always distributed with it. e.g., the print function in R or Python.
  2. "Distributed code" Code that is separately distributed. (e.g., readr in R). Note that might be via a package manager, a download, or even code printed in an appendix (that could be copied into a file and used by a potential user).
  3. "Personal code" Code that is separate from its platform but not yet distributed (e.g., personal scripts)

I propose that we do not annotate "Included code" as software_name, but we do code "Distributed code" and "Personal code".

jameshowison commented 4 years ago

And programming languages or frameworks should be coded (since they are distributed and should be credited).

kermitt2 commented 4 years ago

The reasons why I separated programming language introduced as an aspect of the implementation of a mentioned software, from a programming language as a framework mentioned on its own, are actually very practical:

caifand commented 4 years ago

@kermitt2 Per your second point above, would it make sense to annotate the programming language and the software as two entities? What would be the concern? e.g., technically it would be knotty to have the attribute of one entity as another entity in the serialized corpus?

jameshowison commented 4 years ago

Fan, could you dig out the sociotechnical definition that we came up with, surrounding distributed code? Perhaps after the paper is in for review we can improve this situation and for the TagWorks coding we can add a question about "software framework or language" which would produce a new annotation linked to a specific software mention.

So the first example would become something like:

Data analysis and model fitting were performed using <rs id="software-1"
type="software" sub-type="unnamed">custom scripts</rs> written in <rs
id="software-1" type="framework">Igor Pro</rs> <rs corresp="#software-1"
type="framework-version">6</rs> (<rs corresp="#software-1"
type="creator">WaveMetrics</rs>).</p>

Ug. that example shows how complicated this since instantly we have framework-version and framework-creator ... Perhaps we should rather implement this either as "the code that is actually shared" which is Igor Pro and not the custom scripts. Of course that doesn't really help when there is

Perhaps another approach is to say: software explictly_depends_on software so that both the "custom scripts" (software-1) and the Igor Pro (software-2) are software (and potentially have all the annotations associated with software) but software-1 also has the attribute explicitly_depends_on="software-2").

On Tue, Nov 26, 2019 at 9:52 AM C. Fan Du notifications@github.com wrote:

@kermitt2 https://github.com/kermitt2 Per your second point above, would it make sense to annotate the programming language and the software as two entities? What would be the concern? e.g., technically it would be knotty to have the attribute of one entity as another entity in the serialized corpus?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/642?email_source=notifications&email_token=AAAWOUV2DRINSVEPCCECMA3QVVA5HA5CNFSM4JEMBII2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFGPVRY#issuecomment-558693063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUUT5PFTVK3GDD2CQKTQVVA5HANCNFSM4JEMBIIQ .