opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Optimise feature matrix management to accelerate L2G Training and Prediction #3252

Open ireneisdoomed opened 3 months ago

ireneisdoomed commented 3 months ago

As a developer, I want to optimise feature annotation processing because it will reduce computation time during the L2G training and prediction phases.

Background

L2G at the moment annotates all features of the input credible sets at execution time. We defined it like this because of

However, although sensible, in reality this approach makes L2G training under different scenarios inconvenient. Most of the computation time of the step itself goes into feature annotation, so that every single L2G training, where we annotate all credible sets, takes about 25 minutes. This also affects the prediction part. In this step, only the credible sets for which we want to extract L2G scores are annotated, however I experienced unreasonably long times to extract predictions for 30 loci.

Tasks

ireneisdoomed commented 3 months ago

This PR is relevant for the issue described here https://github.com/opentargets/gentropy/pull/544

After my tests, I concluded that the majority of the computation time goes into the part of feature extraction (the generation of the long dataframe). There is a lot of logic there, but any improvement will make the process better.