Open ireneisdoomed opened 3 months ago
This PR is relevant for the issue described here https://github.com/opentargets/gentropy/pull/544
After my tests, I concluded that the majority of the computation time goes into the part of feature extraction (the generation of the long dataframe). There is a lot of logic there, but any improvement will make the process better.
As a developer, I want to optimise feature annotation processing because it will reduce computation time during the L2G training and prediction phases.
Background
L2G at the moment annotates all features of the input credible sets at execution time. We defined it like this because of
However, although sensible, in reality this approach makes L2G training under different scenarios inconvenient. Most of the computation time of the step itself goes into feature annotation, so that every single L2G training, where we annotate all credible sets, takes about 25 minutes. This also affects the prediction part. In this step, only the credible sets for which we want to extract L2G scores are annotated, however I experienced unreasonably long times to extract predictions for 30 loci.
Tasks