Open AARON-CLARK opened 1 year ago
Totally fine by me!
The one place where I would like to have a bit of input before we introduce anything into the main branch would be regarding the data file format, just to make sure we're aligned in terms of expectations for file size and what data is included. I'd like to avoid a situation where the repo becomes burdensome to work with due to having lots of large files in its trunk.
The one place where I would like to have a bit of input before we introduce anything into the main branch would be regarding the data file format, just to make sure we're aligned in terms of expectations for file size and what data is included.
My proposal for the published data object:
Columns | type |
---|---|
package name | character |
version | character |
pkg score | numeric |
metric score 1 | numeric |
... | numeric |
metric score n | numeric |
We also need to record the date scores were generated, version of riskmetric used and its dependencies.
My first preference is #3, i think that keeping all information together is simplest, however, this could restrict us if we want to change what metadata we want/need to capture in the future.
As to size, the data.frame described above is < 1 MB as a compressed RDS.
We also need to record the date scores were generated, version of riskmetric used and its dependencies.
If these are RData objects, I think that character vectors get run-length-encoded, so having columns for riskmetric version
and date
, even if it's repeated in every row, would add very little to the file size, since it's effectively just recording that the single string applies to all rows.
If we run with a more agnostic format like csv, then I'd lean toward a separate dataset that can be merged in.
I was leaning towards R data objects, I prefer RDS over RDA but if we want to make an actual package then I think we have to use Rda to be able to do something like data(riskscores)
. Are we agreed to make an actual package or do we just want a data folder with the last x number of past results and a scripts folder with some code to generate a new set of results?
I would vote for a package, since it would also make it easier for organizations to reuse our scripts for running riskmetric
against an internal repo and make it easier to do a comparison of internal packages.
At least in its limited scope, I think the package itself would probably just export a function or two for automating riskmetric
and some light data processing, but even at that point I think it's easiest to structure as a package.
So it doesn't get lost, just thought I'd tie this repo to it's original brainstorming session: https://github.com/pharmaR/pharmaR/issues/19
If @dgkf & @emilliman5 aren't opposed, I Happy to open a PR that builds out the package infrastructure a bit and populate with some initial work I did to get us started... maybe some items would just serve as a placeholder until they get replaced with something more permanent. Let me know.