softwaresaved / habeas-corpus

A corpus of research software used in COVID-19 research.
MIT License
5 stars 4 forks source link

Normalize software mentions (a.k.a data cleaning 🧹) #1

Open sdruskat opened 3 years ago

sdruskat commented 3 years ago

What do we have?

The software packages that have been used for COVID-19 research are contained in the CORD-19 dataset.

The issue

A lot of the mentions are different names for the same thing (e.g., ['Statistical Package for Social Sciences (SPSS)', 'SPSS', 'SPSS Statistics'] (different cells in column I)).

What do we really need?

How can we achieve this?

Ideas for how exactly we can achieve this

olexandr-konovalov commented 3 years ago

In a hurry, we may do it once in a way that included manual edits, but ideally this should be automated, so we can run the same procedure for the next version of the dataset.

For example, Python script which you can re-run multiple times, it should report entries that it can't handle, you adapt it and then re-run again until it passes with no anomalies left.

olexandr-konovalov commented 3 years ago

This should also take into account different capitalisations (e.g. Matlab/MATLAB)

orchid00 commented 3 years ago

Hello! @alex-konovalov @sdruskat I did a fair clean, but not all of the clean (there's so many options + typos) 🤯 . Top 10: 1 Statistical Package for the Social Sciences (SPSS) 10308 2 R Programming Language (R) 7521 3 GraphPad Prism 4089 4 Excel 3800 5 stata 3271 6 sas 2448 7 blast 2233 8 graphpad 2143 9 matlab 1780 10 googlescholar 1688

I started with this file: CORD19_software_popularity.csv (I should have probably asked which file to start with. Is this file one entry per paper? probably not) Should be easy to use another file with names in a column. It had 102644 rows to start with and the clean file has 84661 :tada:

I did this in R so I have a script to share. Where do you want this? Examples to continue the clean--- There are many mentions of R packages, and R studio, and Bioconductor, that I did not merged into R Programming Language. Should these be merged?. I did not merge python with biopython and python libraries for the same question above (Should these be merged?) -- at this point I need someone to look at the result and say this and this might be good to merge too.

There are still minor rows that have each 6 to 1 occurrences that I didn't merge, yet. I am aware, but writing all the cases takes a while. There might be similar cases with languages I do not know. I did make everything lower case and remove extra spaces and special characters.

Rplot01

orchid00 commented 3 years ago

looking at this again, I noticed GraphPad Prism, Graphpad and Prism to be three distinct items, but maybe they also should be merged.