softwaresaved / habeas-corpus

A corpus of research software used in COVID-19 research.
MIT License
5 stars 4 forks source link

Create a utility that extracts the licenses based on a code repo URL #10

Closed npch closed 3 years ago

npch commented 3 years ago

Should be able to get license just from the GitHub API

ha0ye commented 3 years ago

I can handle this, from past experience interfacing with the github api in R.

Now in branch "license-from-github-url"

Specific implementation:

  1. read in data/output/CORD19_sampled_with_repos.csv
  2. look up github license (bonus: other API info, e.g. contributors list, references)
  3. write out data/output/CORD19_sampled_with_repos_with_github-metadatada.csv
ha0ye commented 3 years ago

Basic functionality is set now, and the output data is updated.

There are some minor squiggles to resolve, e.g. in one case, the organization github was linked, and not a specific repo.

sdruskat commented 3 years ago

Great, I think in order to close this issue, we should just merge the license info back into a version of the dataset file, e.g., based on CORD19_software_popularity_sampled_QA_DOI.csv?

ha0ye commented 3 years ago

@sdruskat I'll make a new PR to run the new file through my script.