pharmaR / riskmetric

Metrics to evaluate the risk of R packages
https://pharmar.github.io/riskmetric/
Other
156 stars 29 forks source link

re-refine weighting strategy #310

Open pawelru opened 10 months ago

pawelru commented 10 months ago

Based on the available documentation (as well as my own experiments) I do understand that each of the metric components has the same weight value. Of course this can be overwritten by the end-users but I feel that most of the users will just stick to the default. I feel that this has some indirect consequence in a fact that metrics that are "close" to each other could strengthen one aspect by an "overcrowding" effect at the cost of lowering the impact of all other.

I will give you an example that obviously does not exists in the codebase but it present nicely what I mean. Let's assume that currently I have 20 criteria. I will add to this 80 new criteria that analyse download values in 80 different time window. Download metrics are important risk metrics and everyone agrees that this should included. If I will do so, my final risk metric would be driven mostly by download values.

Above example is very unrealistic but I sort of present what's happening right now - but obviously to a lower extent. Currently I can name few clusters of "similar" metrics:

And don't get me wrong - I don't question the existence of the checks - it's rather the weight value.

If you ask me for a suggestion. I really don't have any idea. This probably requires some PCA on a already calculated set of packages. But this implies some hardcoded weights value, quite painful process of new metrics etc.

pawelru commented 10 months ago

Yet another way (probably more feasible to implement but requires more thinking as well as common agreement) is to identify and predefine a category (such as documentation, metadata, static code analysis, dependencies, adoption etc.) assign its weight and then try to link each current risk criterion to a category. As a consequence, in the above example, my 80 new download criterias would be just part of one single let's say "adoption" category and won't lower the impact of other existing criterias within other categories.

dgkf commented 10 months ago

You're totally right to identify this as an issue. As long as the project has existed, we've been struggling with how to settle on the "right" algorithm.

This might be a bit counterintuitive, but in the absence of a consensus, I think having a sloppy algorithm is a feature. It probably doesn't match up with your intuition, which has the intended and desirable side effect of prompting users to look closely at the metrics themselves.

That said, we have a lot of ongoing work to drive us toward something more mature.

  1. We have the riskscore package which @AARON-CLARK has already used to do some CRAN-wide data-driven assessment on what good looks like.
  2. In the repositories workstream, we're piloting the use of custom package filters to embed a scoring criteria as part of fetching available.packages().

Both have the shared goal of making these criteria something that we can form consensus around, and from there we can make a more sensible default.

emilliman5 commented 10 months ago

Thank you for raising this point. I think we do need to put more effort into this aspect of our pipeline (we have mostly focused on the front half: caching metadata and assessments). I think a first step is to educate users (e.g. create a vignette) on custom weighting and maybe creating some functionality to facilitate the creation of custom weights. We can also surface messages about weighting more prominently to prompt users into a deeper dive.

I agree with @dgkf, that keeping the algorithm sloppy in the absence of consensus is a good thing, for now. But with riskscore we can do/provide some interesting analysis on metrics and how different weighting schemes effect a risk score.

Here is a bit of a todo on this topic: