Consider all protein coding genes in the vicinity of a credible set to extract colocalisation neighbourhood features

As a developer, I want to include all protein-coding genes in colocalisation neighbourhood features, assuming an H4/CLPP of 0 for non-colocalised genes. This will create a more accurate baseline and boost feature importance for features in the neighbourhood.

Background

Colocalisation neighbourhood features calculate the average H4/CLPP for a credible set across genes to set a baseline. The genes that are considered to extract this metric are only those that are present in the colocalisation results.

To create a more accurate representation of what is the average value in the region (to then substract that to the local metric), we want to expand the set of considered genes to all protein coding genes. We will assume that their H4/CLPP is 0. This will push down the average metric for all genes, so that differences between the local and the neighbourhood metrics are not so extreme. We therefore expect to see a boost in the importance of these features.

Tasks

[ ] Update the logic in the neighbourhood
[ ] Measure effects on feature importance
[ ] Measure effects on average metrics

The implementation won't be so difficult as a result of the work in https://github.com/opentargets/issues/issues/3552, as gene_index is already a dependency of the colocalisation features.

opentargets / issues

Consider all protein coding genes in the vicinity of a credible set to extract colocalisation neighbourhood features #3563

Background

Tasks