Initial Discussion - Githubissues

pharmaR / riskscore

A data package for cataloging `riskmetric` results across public repositories

Other

0 stars 0 forks source link

Initial Discussion #1

Open AARON-CLARK opened 1 year ago

AARON-CLARK commented 1 year ago

So it doesn't get lost, just thought I'd tie this repo to it's original brainstorming session: https://github.com/pharmaR/pharmaR/issues/19

If @dgkf & @emilliman5 aren't opposed, I Happy to open a PR that builds out the package infrastructure a bit and populate with some initial work I did to get us started... maybe some items would just serve as a placeholder until they get replaced with something more permanent. Let me know.

dgkf commented 1 year ago

Totally fine by me!

The one place where I would like to have a bit of input before we introduce anything into the main branch would be regarding the data file format, just to make sure we're aligned in terms of expectations for file size and what data is included. I'd like to avoid a situation where the repo becomes burdensome to work with due to having lots of large files in its trunk.

emilliman5 commented 1 year ago

The one place where I would like to have a bit of input before we introduce anything into the main branch would be regarding the data file format, just to make sure we're aligned in terms of expectations for file size and what data is included.

My proposal for the published data object:

Columns	type
package name	character
version	character
pkg score	numeric
metric score 1	numeric
...	numeric
metric score n	numeric

We also need to record the date scores were generated, version of riskmetric used and its dependencies.

This could be a second data.frame
we could set this information as attributes on the above data.frame.
We could create fields and add to the above data.frame.

My first preference is #3, i think that keeping all information together is simplest, however, this could restrict us if we want to change what metadata we want/need to capture in the future.

As to size, the data.frame described above is < 1 MB as a compressed RDS.

dgkf commented 1 year ago

We also need to record the date scores were generated, version of riskmetric used and its dependencies.

If these are RData objects, I think that character vectors get run-length-encoded, so having columns for riskmetric version and date, even if it's repeated in every row, would add very little to the file size, since it's effectively just recording that the single string applies to all rows.

If we run with a more agnostic format like csv, then I'd lean toward a separate dataset that can be merged in.

emilliman5 commented 1 year ago

I was leaning towards R data objects, I prefer RDS over RDA but if we want to make an actual package then I think we have to use Rda to be able to do something like data(riskscores). Are we agreed to make an actual package or do we just want a data folder with the last x number of past results and a scripts folder with some code to generate a new set of results?

dgkf commented 1 year ago

I would vote for a package, since it would also make it easier for organizations to reuse our scripts for running riskmetric against an internal repo and make it easier to do a comparison of internal packages.

At least in its limited scope, I think the package itself would probably just export a function or two for automating riskmetric and some light data processing, but even at that point I think it's easiest to structure as a package.