Pull together the normalization information in a dataset ⛙

sdruskat commented 3 years ago

What do we have?

A set of normalized software mentions (#1 )

The issue

We need some sort of dataset to count mentions according to #2.

What do we really need?

[ ] A dataset to count the mentions on

There are several ways this could look:

A list of all normalized mentions with info on which paper they appeared in
An enriched version of CORD-19 with annotations of the normalized mention per software mention per paper, i.e.,
the information that Software1 and Software2 are both mentioned in this paper, even if Software1 was actually mentioned as software one or SW 1, and perhaps the count of each mention per paper
A new dataset which reuses information from CORD-19 but presents it in a cleaned-up fashion, and possibly some other format

How can we achieve this?

Ideas welcome (Jupyter Notebook perhaps?)

olexandr-konovalov commented 3 years ago

+1 for Jupyter. Can have fully automated and reproducible analysis which downloads the CSV file (or has a refined dataset in the repository) and allows to re-run it on Binder: https://github.com/rse-standrewscs/python-binder-template

olexandr-konovalov commented 3 years ago

Still some code should in in .py files, easier to keep under version control, test etc.

Obligatory reading is https://doi.org/10.1371/journal.pcbi.1007007

There is also a tool for diffing and merging Jupyter notebooks: https://nbdime.readthedocs.io/

softwaresaved / habeas-corpus