Closed addramir closed 1 week ago
The draft plan: 1) Take Eric's Fauman list (we are not using it for training). 2) Select best CS as gold positive and assign gold negatives, similar to what we do with training. 3) Use it to validate the model, e.g. FP, FN, TP, TN using l2g>=0.5. Compare it with jsut distance approach (closest by tss) and holistic approach.
This is v5.1_full for cross validation: gs://genetics-portal-dev-analysis/yt4/20241024_EGL_playground/training_set/v5_1_full.json
This is v5.1_validation for validation (==full-5.1_trining) gs://genetics-portal-dev-analysis/yt4/20241024_EGL_playground/training_set/v5_1_validation.json
Related to https://github.com/opentargets/issues/issues/3500. As discussed before we should select a list of gene-EFO pairs for additional validation of resulting L2Gs. These out-of-sample effector genes will not be participating in training the model. Current idea is to use Eric Fauman's list of genes since we use only our curated old list and chembl for training.