molgenis / vibe

Variant Interpretation using Biomedical literature Evidence
GNU Lesser General Public License v3.0
0 stars 5 forks source link

Prevent gene overrepresentation in results #35

Open joerivandervelde opened 4 years ago

joerivandervelde commented 4 years ago

In VIBE r2.0, a number of genes appears in the top lists a lot. At least in the Trujillano et al. benchmark set of 305 cases. For instance, at a cutoff of first 10 hits, we see NCBI gene 4204 (MECP2) occuring 226x and NCBI gene 8085 (KMT2D), occuring 219x. Strategies to fix this could include different weighting, different sources, or in a post-processing step where we first ascertain all gene ranks for all HPO terms separately and correct for overinflation. But its best to tackle this problem as close to the source as possible