Open epijim opened 3 years ago
FYI @dgkf - your thoughts? I see this has a really useful way to contextualise riskmetric (e.g. show plot and display centile).
@timtreis - have you productionised this already? If not, is it ok we move into a gh-action CRON job?
Code we used in visR to add the riskmetric score to the readme
riskmetric_score <- "visR" %>%
riskmetric::pkg_ref() %>%
dplyr::as_tibble() %>%
riskmetric::pkg_assess() %>%
riskmetric::pkg_score() %>%
dplyr::pull("pkg_score") %>%
round(2)
That's absolutely awesome @timtreis! I hadn't seen this analysis yet. Please post it to the pharmaR/riskmetric repo! Happy to dig into the details over there, but I'll hold back from derailing this thread.
In general, I'm reluctant to put too much emphasis on these metrics. Because
I think it's better to make sure the data used in any report is concrete and determinant, and leave the fuzzy metrics for organizational decision making regarding what concrete assessment results are acceptable. Generally, these metrics require a bit of interpretation and exploration to contextualize them in the context of each package. It could be nice to make these metrics accessible easily (eg a badge/vignette for a pkgdown page), but I wouldn't suggest it as part of a report - it opens up a lot of questions that are very difficult to automate and boil down to human decision making.
Ok - I posted Tim's full message as I think he was being very cautious (more than I, 😄 ).
At a simple level, I guess for v1.0 we just take the output of I think riskmetric::pkg_score()
(as_tibble
), pivot it long and print it as a table, unless riskmetric has some pretty printing functions built tin.
I still like the pre-rendering riskmetric scores idea as a lower priority - if we can go through the full motions (e.g. download all CRAN... 😟 ?), but acknowledge it's limitations? e.g. state, applying equal weight to each metric assessed, and they are all 0-1 binary or continuous (if it does this 😕 , I haven't read the docs in detail), this is the mean as a rough indicator. And knowing how that rough indicator sits among 5,000 other 'decent' packages (as CRAN usually implies it's decent), is maybe useful?
I just assigned myself to this - I'm happy to add that really simple step (print out how it does in each metric as a table).
Depending on any follow on comments, I guess we can keep this open and rename it after that simple table is up - or make a new issue for the cooler ideas @timtreis was looking into (and @dgkf giving great feedback on).
state, applying equal weight to each metric assessed, and they are all 0-1 binary or continuous
There are some default weights for the default summary function, but we can easily apply a uniform weighting. It can be specified as an argument. The defaults I think give extra weight to coverage and R CMD check
, but they were never really vetted as a best representation of summarizing risk - just provided as an example of how one might consider scoring.
One of the things we'd love to do is to run riskmetric
against a large cohort of popular packages and ask savvy R users in the industry to anonymously rate a cohort of packages, then use those "truth" values to train a consensus scoring algorithm. This is a longer-term goal that has been kicking around since the start of riskmetric
. At that point, I'd feel much more comfortable endorsing a singular score, as long as we can point to the methodology of how it was derived.
One of the things we'd love to do is to run riskmetric against a large cohort of popular packages and ask savvy R users in the industry to anonymously rate a cohort of packages, then use those "truth" values to train a consensus scoring algorithm. This is a longer-term goal that has been kicking around since the start of riskmetric. At that point, I'd feel much more comfortable endorsing a singular score, as long as we can point to the methodology of how it was derived.
That's very cool. I wonder how we'd structure it beyond maybe a discrete choice experiment? (e.g. throw up 'which do you trust more of these two'). If you gave me packages and asked for them to be subjectively ranked in order of trust, or given a 'grade' I'd struggle (maybe I'm not savvy though 😆 ). I think there are enough pharma's validating packages internally you could get maybe 3-4 companies to provide binary 'we trusted this package as-is vs we required remediation/additional testing' (if you blinded the company), but I guess that's only enough data points to do something descriptive.
Different approach, but we could also commission a CRO to independently grade the packages? Then use that ranking to evaluate and refine the riskmetric score.
For savvy user ranking- I feel like a lot of the signal from human rankings will be captured by github org, maintainer name (and also email domain - e.g. @rstudio.com). As I'd totally be like - well if it's in rlib, tidyverse, rstudio, ropensci orgs etc it's assumed good (+ base R). And anything in a personal github org is worrying (although some great packages are in personal orgs).
Ok - renamed, and executing the simpler topic in #18
Just to chime in, back when I made that figure (and the disclaimer) there seems to have been a bug in riskmetric
which (correct me if I'm wrong) @dgkf already fixed. Currently I'm running a cronjob every 3 h on my personal machine that queries visR
the riskmetric
score. That for example nicely captures the steep increase in issues we opened while preparing for the workshops.
Theoretically, this could be extended to an arbitrary amount of packages and maybe done for all (or a random sample) of CRAN as well (on a schedule that they don't identify as DDOSing). As a start I could extend this to a small cohort that covers all of Roche/openpharma and the classics?
But apart from the CRAN-scraping (very easy), the riskmetric query with install and subsequent uninstall (no longer needed maybe) and plotting, it's not yet fully productionized.
Another thought would be to maybe integrate this into some central https://github.com/openpharma workflow which then hosts the csv for the entire openpharma cohort and has the data as a csv in the repo. Then the member-packages could pull the data from there, highlight themselves and paste the figure into their doc?
For the "trusted expert cohort rating" idea, one could set up a website that presents the raters with pairwise comparisons and the options
"I trust A more", "I trust B more", "I don't know both packages or can't decide"
and then deconvolute this afterwards.
postponing for after 1.0
@timtreis did below - can we get this data building (maybe once a month or once a quarter?) and available on github as a file, then we can add to the template a plot showing this packages risk metric score (calculated now) against this pre-calculated score data for all the other packages. To help people understand how this package compares to all of CRAN.