Closed ireneisdoomed closed 1 month ago
As we discussed, this is just a change in the denominator that calculates the mean. Suppose you have 10 protein-coding genes, but you only have two coloc results for two genes, e.g. h4_1=0.3, h4_2=0.4. For now, the mean is calculated as E=(0.3+0.4)/2, but it should be E=(0.3+0.4)/protein coding=(0.3+0.4)/10
As a developer, I want to include all protein-coding genes in colocalisation neighbourhood features, assuming an H4/CLPP of 0 for non-colocalised genes. This will create a more accurate baseline and boost feature importance for features in the neighbourhood.
Background
Colocalisation neighbourhood features calculate the average H4/CLPP for a credible set across genes to set a baseline. The genes that are considered to extract this metric are only those that are present in the colocalisation results.
To create a more accurate representation of what is the average value in the region (to then substract that to the local metric), we want to expand the set of considered genes to all protein coding genes. We will assume that their H4/CLPP is 0. This will push down the average metric for all genes, so that differences between the local and the neighbourhood metrics are not so extreme. We therefore expect to see a boost in the importance of these features.
Tasks
The implementation won't be so difficult as a result of the work in https://github.com/opentargets/issues/issues/3552, as
gene_index
is already a dependency of the colocalisation features.