opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Consider all protein coding genes in the vicinity of a credible set to extract colocalisation neighbourhood features #3563

Closed ireneisdoomed closed 1 month ago

ireneisdoomed commented 1 month ago

As a developer, I want to include all protein-coding genes in colocalisation neighbourhood features, assuming an H4/CLPP of 0 for non-colocalised genes. This will create a more accurate baseline and boost feature importance for features in the neighbourhood.

Background

Colocalisation neighbourhood features calculate the average H4/CLPP for a credible set across genes to set a baseline. The genes that are considered to extract this metric are only those that are present in the colocalisation results.

To create a more accurate representation of what is the average value in the region (to then substract that to the local metric), we want to expand the set of considered genes to all protein coding genes. We will assume that their H4/CLPP is 0. This will push down the average metric for all genes, so that differences between the local and the neighbourhood metrics are not so extreme. We therefore expect to see a boost in the importance of these features.

Tasks

The implementation won't be so difficult as a result of the work in https://github.com/opentargets/issues/issues/3552, as gene_index is already a dependency of the colocalisation features.

addramir commented 1 month ago

As we discussed, this is just a change in the denominator that calculates the mean. Suppose you have 10 protein-coding genes, but you only have two coloc results for two genes, e.g. h4_1=0.3, h4_2=0.4. For now, the mean is calculated as E=(0.3+0.4)/2, but it should be E=(0.3+0.4)/protein coding=(0.3+0.4)/10