Open addramir opened 1 month ago
Based on a conversation with @addramir :
We have agreed that the process of defining GS will be done manually when we have the final feature matrix (FM) and effector gene list (EGL). So we will assign credible sets to EGL and define GSP and GSN. This process should be part of ETL, yes, but only because we have limited time - we have to do it manually for the release. After that we have time to automate it and make it part of ETL.
First iteration of a dynamic gold standard list.
Positives were generated manually by @addramir. Positives are made with a list of effector genes derived from the gold standard annotation in production. 232 positives were expanded with negatives following the same approach we have now in Gentropy (no cherrypicking).
W&B Run: https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/yyfh2qok?nw=nwuseropentargets
The confusion matrix shows less bias towards the negative set, but the precision is still not good, probably due to the training set size.
Predictions are in gs://ot-team/irene/dynamic_l2g/25-10_manual_egl_yakov
New L2G model trained using:
+---------------+------+
|goldStandardSet| count|
+---------------+------+
| positive| 8536|
| negative|197832|
+---------------+------+
W&B run: https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/pvuqm9br?nw=nwuseropentargets
Metrics and confusion matrix look really good. Really interesting the feature importances, moving away from distance.
Model is uploaded here: https://huggingface.co/opentargets/locus_to_gene_egl
And feature matrix + model are also saved here gs://ot-team/irene/dynamic_l2g/
cc @addramir Instead of using associations with high L2G for training, could we use them as the testing dataset of our model trained on the manual EGL?
@xyg123 please add details.
Related to old issue - https://github.com/opentargets/issues/issues/3526