Implement dynamically updating gold standard list

addramir commented 1 month ago

@xyg123 please add details.

Related to old issue - https://github.com/opentargets/issues/issues/3526

ireneisdoomed commented 1 month ago

Based on a conversation with @addramir :

We have agreed that the process of defining GS will be done manually when we have the final feature matrix (FM) and effector gene list (EGL). So we will assign credible sets to EGL and define GSP and GSN. This process should be part of ETL, yes, but only because we have limited time - we have to do it manually for the release. After that we have time to automate it and make it part of ETL.

ireneisdoomed commented 3 weeks ago

25/10_manual_egl_yakov

First iteration of a dynamic gold standard list.

Positives were generated manually by @addramir. Positives are made with a list of effector genes derived from the gold standard annotation in production. 232 positives were expanded with negatives following the same approach we have now in Gentropy (no cherrypicking).

W&B Run: https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/yyfh2qok?nw=nwuseropentargets

The confusion matrix shows less bias towards the negative set, but the precision is still not good, probably due to the training set size.

Predictions are in gs://ot-team/irene/dynamic_l2g/25-10_manual_egl_yakov

Code to reproduce

```python import pyspark.sql.functions as f from sklearn.ensemble import GradientBoostingClassifier from wandb import login as wandb_login from gentropy.common.session import Session from gentropy.common.utils import access_gcp_secret from gentropy.dataset.l2g_feature_matrix import L2GFeatureMatrix from gentropy.dataset.l2g_gold_standard import L2GGoldStandard from gentropy.dataset.l2g_prediction import L2GPrediction from gentropy.dataset.study_locus import StudyLocus from gentropy.dataset.variant_index import VariantIndex from gentropy.datasource.open_targets.l2g_gold_standard import ( OpenTargetsL2GGoldStandard, ) from gentropy.method.l2g.model import LocusToGeneModel from gentropy.method.l2g.trainer import LocusToGeneTrainer session = Session("yarn") ## Build gold standard list variant_index = VariantIndex.from_parquet( session, "gs://ot_orchestration/releases/24.10_freeze4/variant_index" ) credible_set = StudyLocus.from_parquet( session, "gs://ot_orchestration/releases/24.10_freeze4/credible_set", recursiveFileLookup=True, ) features_list = [ # max CLPP for each (study, locus, gene) aggregating over a specific qtl type "eQtlColocClppMaximum", "pQtlColocClppMaximum", "sQtlColocClppMaximum", # max H4 for each (study, locus, gene) aggregating over a specific qtl type "eQtlColocH4Maximum", "pQtlColocH4Maximum", "sQtlColocH4Maximum", # max CLPP for each (study, locus, gene) aggregating over a specific qtl type and in relation with the mean in the vicinity "eQtlColocClppMaximumNeighbourhood", "pQtlColocClppMaximumNeighbourhood", "sQtlColocClppMaximumNeighbourhood", # max H4 for each (study, locus, gene) aggregating over a specific qtl type and in relation with the mean in the vicinity "eQtlColocH4MaximumNeighbourhood", "pQtlColocH4MaximumNeighbourhood", "sQtlColocH4MaximumNeighbourhood", # distance to gene footprint "distanceSentinelFootprint", "distanceSentinelFootprintNeighbourhood", "distanceFootprintMean", "distanceFootprintMeanNeighbourhood", # distance to gene tss "distanceTssMean", "distanceTssMeanNeighbourhood", "distanceSentinelTss", "distanceSentinelTssNeighbourhood", # vep "vepMaximum", "vepMaximumNeighbourhood", "vepMean", "vepMeanNeighbourhood", ] full_fm = L2GFeatureMatrix( _df=session.load_data( "gs://ot_orchestration/releases/24.10_freeze4/locus_to_gene_feature_matrix" ), features_list=features_list, ) positives_df = ( session.load_data( "gs://genetics-portal-dev-analysis/yt4/20241024_EGL_playground/GSP_cs_selected.parquet" ) .join( credible_set.df.select("studyLocusId", "variantId"), "studyLocusId", ) .select( "studyId", "studyLocusId", "variantId", "geneId", f.lit("positive").alias("goldStandardSet"), ) .distinct() ) positives_negatives_df = OpenTargetsL2GGoldStandard.expand_gold_standard_with_negatives( positives_df, variant_index ) gs = L2GGoldStandard(_df=positives_negatives_df, _schema=L2GGoldStandard.get_schema()) fm = gs.build_feature_matrix(full_fm, credible_set).fill_na() # Train wandb_key = access_gcp_secret("wandb-key", "open-targets-genetics-dev") l2g_model = LocusToGeneModel( model=GradientBoostingClassifier(random_state=42), hyperparameters={ "n_estimators": 100, "max_depth": 5, "loss": "log_loss", }, ) wandb_login(key=wandb_key) trained_model = LocusToGeneTrainer( model=l2g_model, feature_matrix=fm, ).train("25/10_manual_egl_yakov") predictions = L2GPrediction.from_credible_set( session, credible_set, full_fm, features_list, model_path=None, hf_token=access_gcp_secret("hfhub-key", "open-targets-genetics-dev"), download_from_hub=True, ) ```

ireneisdoomed commented 1 week ago

05-11_manual_egl_old_l2g_hits

New L2G model trained using:

Positives: Credible sets corresponding to effector gene list generated on all disease/gene ID pairs with a L2G score > 0.7 in production
Negatives: Same approach as other runs (without choosing representative negatives)

+---------------+------+                                                        
|goldStandardSet| count|
+---------------+------+
|       positive|  8536|
|       negative|197832|
+---------------+------+

W&B run: https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/pvuqm9br?nw=nwuseropentargets

Metrics and confusion matrix look really good. Really interesting the feature importances, moving away from distance.

Model is uploaded here: https://huggingface.co/opentargets/locus_to_gene_egl And feature matrix + model are also saved here gs://ot-team/irene/dynamic_l2g/

cc @addramir Instead of using associations with high L2G for training, could we use them as the testing dataset of our model trained on the manual EGL?

Code to reproduce

```python import pyspark.sql.functions as f from sklearn.ensemble import GradientBoostingClassifier from wandb import login as wandb_login from gentropy.common.session import Session from gentropy.common.utils import access_gcp_secret from gentropy.dataset.l2g_feature_matrix import L2GFeatureMatrix from gentropy.dataset.l2g_gold_standard import L2GGoldStandard from gentropy.dataset.l2g_prediction import L2GPrediction from gentropy.dataset.study_locus import StudyLocus from gentropy.dataset.variant_index import VariantIndex from gentropy.datasource.open_targets.l2g_gold_standard import ( OpenTargetsL2GGoldStandard, ) from gentropy.method.l2g.model import LocusToGeneModel from gentropy.method.l2g.trainer import LocusToGeneTrainer session = Session("yarn") ## Build gold standard list variant_index = VariantIndex.from_parquet( session, "gs://ot_orchestration/releases/24.10_freeze6/variant_index" ) credible_set = StudyLocus.from_parquet( session, "gs://ot_orchestration/releases/24.10_freeze6/credible_set", recursiveFileLookup=True, ) features_list = [ # max CLPP for each (study, locus, gene) aggregating over a specific qtl type "eQtlColocClppMaximum", "pQtlColocClppMaximum", "sQtlColocClppMaximum", # max H4 for each (study, locus, gene) aggregating over a specific qtl type "eQtlColocH4Maximum", "pQtlColocH4Maximum", "sQtlColocH4Maximum", # max CLPP for each (study, locus, gene) aggregating over a specific qtl type and in relation with the mean in the vicinity "eQtlColocClppMaximumNeighbourhood", "pQtlColocClppMaximumNeighbourhood", "sQtlColocClppMaximumNeighbourhood", # max H4 for each (study, locus, gene) aggregating over a specific qtl type and in relation with the mean in the vicinity "eQtlColocH4MaximumNeighbourhood", "pQtlColocH4MaximumNeighbourhood", "sQtlColocH4MaximumNeighbourhood", # distance to gene footprint "distanceSentinelFootprint", "distanceSentinelFootprintNeighbourhood", "distanceFootprintMean", "distanceFootprintMeanNeighbourhood", # distance to gene tss "distanceTssMean", "distanceTssMeanNeighbourhood", "distanceSentinelTss", "distanceSentinelTssNeighbourhood", # vep "vepMaximum", "vepMaximumNeighbourhood", "vepMean", "vepMeanNeighbourhood", # other "geneCount500kb", "proteinGeneCount500kb", "credibleSetConfidence", "isProteinCoding", ] full_fm = L2GFeatureMatrix( _df=session.load_data( "gs://ot-team/irene/l2g/05112024/feature_matrix_filtered" ) ) positives_df = ( session.load_data( "gs://genetics-portal-dev-analysis/yt4/20241024_EGL_playground/GSP_cs_selected_freez5_v4.parquet" ) .join( credible_set.df.select("studyLocusId", "variantId"), "studyLocusId", ) .select( "studyId", "studyLocusId", "variantId", "geneId", f.lit("positive").alias("goldStandardSet"), ) .distinct() ) positives_negatives_df = OpenTargetsL2GGoldStandard.expand_gold_standard_with_negatives( positives_df, variant_index ) gs = L2GGoldStandard(_df=positives_negatives_df, _schema=L2GGoldStandard.get_schema()) fm = gs.build_feature_matrix(full_fm, credible_set).fill_na().persist() fm._df.write.parquet("gs://ot-team/irene/dynamic_l2g/05-11_manual_egl_old_l2g_hits_feature_matrix") ## Train wandb_key = access_gcp_secret("wandb-key", "open-targets-genetics-dev") fm = L2GFeatureMatrix(_df=session.load_data("gs://ot-team/irene/dynamic_l2g/05-11_manual_egl_old_l2g_hits_feature_matrix"), with_gold_standard=True) l2g_model = LocusToGeneModel( model=GradientBoostingClassifier(random_state=42), hyperparameters={ "n_estimators": 100, "max_depth": 5, "loss": "log_loss", }, ) wandb_login(key=wandb_key) trained_model = LocusToGeneTrainer( model=l2g_model, feature_matrix=fm, ).train("05/11_manual_egl_old_l2g_hits") trained_model.save("gs://ot-team/irene/dynamic_l2g/05-11_manual_egl_old_l2g_hits.skops") hf_hub_token = access_gcp_secret("hfhub-key", "open-targets-genetics-dev") trained_model.export_to_hugging_face_hub( # we upload the model in the filesystem "05-11_manual_egl_old_l2g_hits.skops", hf_hub_token, data=trained_model.training_data._df.drop("goldStandardSet", "geneId").toPandas(), commit_message="chore: update model (run 05-11_manual_egl_old_l2g_hits)", repo_id="opentargets/locus_to_gene_egl" ) ```

opentargets / issues

Implement dynamically updating gold standard list #3526

25/10_manual_egl_yakov

05-11_manual_egl_old_l2g_hits