Optimise feature matrix management to accelerate L2G Training and Prediction

As a developer, I want to optimise feature annotation processing because it will reduce computation time during the L2G training and prediction phases.

Background

L2G at the moment annotates all features of the input credible sets at execution time. We defined it like this because of

the fact that feature matrix is purely an intermediate dataset only useful in the process of L2G training/prediction
its reliability, we don't introduce a codependence between 2 files.

However, although sensible, in reality this approach makes L2G training under different scenarios inconvenient. Most of the computation time of the step itself goes into feature annotation, so that every single L2G training, where we annotate all credible sets, takes about 25 minutes. This also affects the prediction part. In this step, only the credible sets for which we want to extract L2G scores are annotated, however I experienced unreasonably long times to extract predictions for 30 loci.

Tasks

[ ] Ensure business logic of the colocalisation factories doesn't have big bottlenecks that might be slowing the process down
[ ] If not, consider the idea of writing the feature matrix as another dataset. To ensure credible set/feature matrix compatibility, we must assert that all credible sets are part of the feature matrix.

opentargets / issues

Optimise feature matrix management to accelerate L2G Training and Prediction #3252

Background

Tasks