insightsengineering / thevalidatoR

Github Action that generates R Package Validation documentation 🏁
https://github.com/marketplace/actions/r-package-validation-report
MIT License
57 stars 5 forks source link

Can (or should) we use the riskmetric score in the template? #17

Open epijim opened 2 years ago

epijim commented 2 years ago

@timtreis did below - can we get this data building (maybe once a month or once a quarter?) and available on github as a file, then we can add to the template a plot showing this packages risk metric score (calculated now) against this pre-calculated score data for all the other packages. To help people understand how this package compares to all of CRAN.

image

epijim commented 2 years ago

FYI @dgkf - your thoughts? I see this has a really useful way to contextualise riskmetric (e.g. show plot and display centile).

@timtreis - have you productionised this already? If not, is it ok we move into a gh-action CRON job?

epijim commented 2 years ago

Code we used in visR to add the riskmetric score to the readme

riskmetric_score <- "visR" %>%
  riskmetric::pkg_ref() %>%
  dplyr::as_tibble() %>%
  riskmetric::pkg_assess() %>%
  riskmetric::pkg_score() %>%
  dplyr::pull("pkg_score") %>%
  round(2)
dgkf commented 2 years ago

That's absolutely awesome @timtreis! I hadn't seen this analysis yet. Please post it to the pharmaR/riskmetric repo! Happy to dig into the details over there, but I'll hold back from derailing this thread.

In general, I'm reluctant to put too much emphasis on these metrics. Because

  1. The metrics are fuzzy - some can change day-to-day, and their interpretation can be quite subjective
  2. I'm cautious to embed too many difficult-to-obtain metrics, especially things that don't translate well to other software as to avoid building these as expectations
  3. The summary score is a simple weighted mean, with the weights only provided as an example. If we do decide to run with embedding riskmetric info, I'd suggest incorporating just the various metric data, not the summary (contributions are welcome to help make this summarizing function more data-driven, but we haven't tried to tackle that yet).

I think it's better to make sure the data used in any report is concrete and determinant, and leave the fuzzy metrics for organizational decision making regarding what concrete assessment results are acceptable. Generally, these metrics require a bit of interpretation and exploration to contextualize them in the context of each package. It could be nice to make these metrics accessible easily (eg a badge/vignette for a pkgdown page), but I wouldn't suggest it as part of a report - it opens up a lot of questions that are very difficult to automate and boil down to human decision making.

epijim commented 2 years ago

Ok - I posted Tim's full message as I think he was being very cautious (more than I, 😄 ).

At a simple level, I guess for v1.0 we just take the output of I think riskmetric::pkg_score() (as_tibble), pivot it long and print it as a table, unless riskmetric has some pretty printing functions built tin.

I still like the pre-rendering riskmetric scores idea as a lower priority - if we can go through the full motions (e.g. download all CRAN... 😟 ?), but acknowledge it's limitations? e.g. state, applying equal weight to each metric assessed, and they are all 0-1 binary or continuous (if it does this 😕 , I haven't read the docs in detail), this is the mean as a rough indicator. And knowing how that rough indicator sits among 5,000 other 'decent' packages (as CRAN usually implies it's decent), is maybe useful?

epijim commented 2 years ago

I just assigned myself to this - I'm happy to add that really simple step (print out how it does in each metric as a table).

Depending on any follow on comments, I guess we can keep this open and rename it after that simple table is up - or make a new issue for the cooler ideas @timtreis was looking into (and @dgkf giving great feedback on).

dgkf commented 2 years ago

state, applying equal weight to each metric assessed, and they are all 0-1 binary or continuous

There are some default weights for the default summary function, but we can easily apply a uniform weighting. It can be specified as an argument. The defaults I think give extra weight to coverage and R CMD check, but they were never really vetted as a best representation of summarizing risk - just provided as an example of how one might consider scoring.

One of the things we'd love to do is to run riskmetric against a large cohort of popular packages and ask savvy R users in the industry to anonymously rate a cohort of packages, then use those "truth" values to train a consensus scoring algorithm. This is a longer-term goal that has been kicking around since the start of riskmetric. At that point, I'd feel much more comfortable endorsing a singular score, as long as we can point to the methodology of how it was derived.

epijim commented 2 years ago

One of the things we'd love to do is to run riskmetric against a large cohort of popular packages and ask savvy R users in the industry to anonymously rate a cohort of packages, then use those "truth" values to train a consensus scoring algorithm. This is a longer-term goal that has been kicking around since the start of riskmetric. At that point, I'd feel much more comfortable endorsing a singular score, as long as we can point to the methodology of how it was derived.

That's very cool. I wonder how we'd structure it beyond maybe a discrete choice experiment? (e.g. throw up 'which do you trust more of these two'). If you gave me packages and asked for them to be subjectively ranked in order of trust, or given a 'grade' I'd struggle (maybe I'm not savvy though 😆 ). I think there are enough pharma's validating packages internally you could get maybe 3-4 companies to provide binary 'we trusted this package as-is vs we required remediation/additional testing' (if you blinded the company), but I guess that's only enough data points to do something descriptive.

Different approach, but we could also commission a CRO to independently grade the packages? Then use that ranking to evaluate and refine the riskmetric score.

epijim commented 2 years ago

For savvy user ranking- I feel like a lot of the signal from human rankings will be captured by github org, maintainer name (and also email domain - e.g. @rstudio.com). As I'd totally be like - well if it's in rlib, tidyverse, rstudio, ropensci orgs etc it's assumed good (+ base R). And anything in a personal github org is worrying (although some great packages are in personal orgs).

epijim commented 2 years ago

Ok - renamed, and executing the simpler topic in #18

timtreis commented 2 years ago

Just to chime in, back when I made that figure (and the disclaimer) there seems to have been a bug in riskmetric which (correct me if I'm wrong) @dgkf already fixed. Currently I'm running a cronjob every 3 h on my personal machine that queries visR the riskmetric score. That for example nicely captures the steep increase in issues we opened while preparing for the workshops. image

Theoretically, this could be extended to an arbitrary amount of packages and maybe done for all (or a random sample) of CRAN as well (on a schedule that they don't identify as DDOSing). As a start I could extend this to a small cohort that covers all of Roche/openpharma and the classics?

But apart from the CRAN-scraping (very easy), the riskmetric query with install and subsequent uninstall (no longer needed maybe) and plotting, it's not yet fully productionized.

timtreis commented 2 years ago

Another thought would be to maybe integrate this into some central https://github.com/openpharma workflow which then hosts the csv for the entire openpharma cohort and has the data as a csv in the repo. Then the member-packages could pull the data from there, highlight themselves and paste the figure into their doc?

timtreis commented 2 years ago

For the "trusted expert cohort rating" idea, one could set up a website that presents the raters with pairwise comparisons and the options

"I trust A more", "I trust B more", "I don't know both packages or can't decide"

and then deconvolute this afterwards.

epijim commented 2 years ago

postponing for after 1.0