re-refine weighting strategy

pawelru commented 10 months ago

Based on the available documentation (as well as my own experiments) I do understand that each of the metric components has the same weight value. Of course this can be overwritten by the end-users but I feel that most of the users will just stick to the default. I feel that this has some indirect consequence in a fact that metrics that are "close" to each other could strengthen one aspect by an "overcrowding" effect at the cost of lowering the impact of all other.

I will give you an example that obviously does not exists in the codebase but it present nicely what I mean. Let's assume that currently I have 20 criteria. I will add to this 80 new criteria that analyse download values in 80 different time window. Download metrics are important risk metrics and everyone agrees that this should included. If I will do so, my final risk metric would be driven mostly by download values.

Above example is very unrealistic but I sort of present what's happening right now - but obviously to a lower extent. Currently I can name few clusters of "similar" metrics:

has_website, has_bug_reports_url, has_source_control checks a single field in the DESCRIPTION file at the cost of e.g. checking coverage that has only one representative
has_examples, has_vignettes checks the documentation hence it's twice as important as e.g. bugs resolving that has only one representative
(possibly others)

And don't get me wrong - I don't question the existence of the checks - it's rather the weight value.

If you ask me for a suggestion. I really don't have any idea. This probably requires some PCA on a already calculated set of packages. But this implies some hardcoded weights value, quite painful process of new metrics etc.

pawelru commented 10 months ago

Yet another way (probably more feasible to implement but requires more thinking as well as common agreement) is to identify and predefine a category (such as documentation, metadata, static code analysis, dependencies, adoption etc.) assign its weight and then try to link each current risk criterion to a category. As a consequence, in the above example, my 80 new download criterias would be just part of one single let's say "adoption" category and won't lower the impact of other existing criterias within other categories.

dgkf commented 10 months ago

You're totally right to identify this as an issue. As long as the project has existed, we've been struggling with how to settle on the "right" algorithm.

This might be a bit counterintuitive, but in the absence of a consensus, I think having a sloppy algorithm is a feature. It probably doesn't match up with your intuition, which has the intended and desirable side effect of prompting users to look closely at the metrics themselves.

That said, we have a lot of ongoing work to drive us toward something more mature.

We have the riskscore package which @AARON-CLARK has already used to do some CRAN-wide data-driven assessment on what good looks like.
In the repositories workstream, we're piloting the use of custom package filters to embed a scoring criteria as part of fetching available.packages().

Both have the shared goal of making these criteria something that we can form consensus around, and from there we can make a more sensible default.

emilliman5 commented 10 months ago

Thank you for raising this point. I think we do need to put more effort into this aspect of our pipeline (we have mostly focused on the front half: caching metadata and assessments). I think a first step is to educate users (e.g. create a vignette) on custom weighting and maybe creating some functionality to facilitate the creation of custom weights. We can also surface messages about weighting more prominently to prompt users into a deeper dive.

I agree with @dgkf, that keeping the algorithm sloppy in the absence of consensus is a good thing, for now. But with riskscore we can do/provide some interesting analysis on metrics and how different weighting schemes effect a risk score.

Here is a bit of a todo on this topic:

[ ] Create a vignette on how to apply custom weights to pkg_score as well as a discussion on points to consider or possible weighting schemes
[ ] Create a function(s) to facilitate setting custom weights.
- [ ] a function to populate a weighting vector
- [ ] a function to group metrics and set weights by group
[ ] add messaging to pkg_score regarding metric weighting
- [ ] message if default of equal weighting is used
- [ ] message if custome weight vector does not include all non-missing metrics
- [ ] link to vignette on custom weight

pharmaR / riskmetric

re-refine weighting strategy #310