Open ireneisdoomed opened 2 weeks ago
I have prepared a dataset with cases where the assigned genes are more than 1, and I have prepared some quality metrics based on the availability of features. So that now I have a dataframe like this one:
+--------------------------------+---------------+------------------+----------------------------------------------+
|studyLocusId |geneId |score |qualityControls |
+--------------------------------+---------------+------------------+----------------------------------------------+
|696aa17876038f0b46ac65c5e27d3fa7|ENSG00000075213|0.982434345980689 |[] |
|696aa17876038f0b46ac65c5e27d3fa7|ENSG00000170381|0.7804732926950384|[VEP features missing, Coloc features missing]|
+--------------------------------+---------------+------------------+----------------------------------------------+
Looking at cases like this, it is clear why ENSG00000075213 is the top ranked gene. And it is clear as well that, regardless of the iterations we do on the VEP or Colocalisation features to finetune them, we are not going to improve the selectivity because we simply lack any functional genomics data for ENSG00000170381.
My question was to analyse systematically how many times we have this scenario.
I think the most informative table is this one, where I try to understand the source of selectivity issues:
+-----------------------------------------------------------+--------------+------------------------------------------+
|flags |granular_count|general_flag |
+-----------------------------------------------------------+--------------+------------------------------------------+
|[First and incomplete VEP, Second and incomplete VEP] |31002 |Both Incomplete - same group |
|[Second and incomplete VEP, First and complete features] |17318 |Only first complete |
|[Second and incomplete VEP, First and incomplete Coloc] |15491 |Both Incomplete - different groups |
|[Second and complete features, First and complete features]|14189 |Both Complete |
|[Second and incomplete Coloc, First and incomplete Coloc] |7475 |Both Incomplete - same group |
|[First and incomplete VEP, Second and complete features] |5308 |Only second complete |
|[Second and incomplete Coloc, First and complete features] |4895 |Only first complete |
|[Second and incomplete Coloc, First and incomplete VEP] |4597 |Both Incomplete - different groups |
|[Second and complete features, First and incomplete Coloc] |3248 |Only second complete |
|[First and incomplete VEP] |2764 |Genes have same score |
|[First and incomplete Coloc] |32 |Genes have same score |
|[First and complete features] |6 |Genes have same score |
|[First and incomplete Coloc, First and complete features] |1 |Genes have same score - different features|
+-----------------------------------------------------------+--------------+------------------------------------------+
Only first complete
, with 22,213 credible sets, where changes in the feature modelling won't yield any benefits because simply there's no data. This is the case of the example highlighted above.It'd be good to reproduce this in future iterations. Code is here https://gist.github.com/ireneisdoomed/31086fe5ef8ddb76dd06d22c54950e2d
These are the predictions with the feature matrix computed taking the max in the neighbourhood (PR), and using the curation for model training
gs://ot-team/irene/l2g/08112024/locus_to_gene_predictions
Selectivity looks slightly better, with 99,541 credible sets with multiple genes
+----+------+
|hits| count|
+----+------+
| >2| 24192|
| 1|247555|
| 2| 75349|
+----+------+
From @addramir:
55.4% loci had a single gene with l2g>0.5 1.4% had >=2 genes with l2g>0.5 43.2% had 0 genes with l2g>0.5
Another set of predictions testing 2 things:
Less filtered and much bigger gold standard: Based on old L2G hits, here credible sets are picked up based on distance, and not based on the availability of functional genomics data.
+---------------+------+
|goldStandardSet| count|
+---------------+------+
| positive| 15863|
| negative|332719|
+---------------+------+
Path: gs://genetics-portal-dev-analysis/yt4/20241024_EGL_playground/GSP_cs_selected_freez6_v5.parquet
Redefinition of the neihbourhood features to match the old L2G. For every credible set, instead of substracting the max in the region, we divide them.
Path: gs://ot-team/irene/l2g/12112024/feature_matrix
Predictions are really unselective, we have flipped the coin.
gs://ot-team/irene/l2g/12112024/locus_to_gene_predictions/"
+----+------+
|hits| count|
+----+------+
| >2|453793|
| 1| 7812|
| 2| 10113|
+----+------+
IMO we are observing a huge overfitting here. I think it'd be a very good idea to plot a learning curve to decide training set size and to diagnose how fit is the model.
This model was trained by @@xyg123 based on a selected list of Effector gene pairs.
Model used is here opentargets/locus_to_gene_egl_stringent
Strikingly, the selectivity is exactly the same as the previous results based on a very different gold standard.
+----+------+
|hits| count|
+----+------+
| >2|453793|
| 1| 7812|
| 2| 10113|
+----+------+
The commonality between 2 runs is the feature matrix, I'm going to train the model with a previous GS to see if the new neighbourhood features are the culprit.
This is the config for these predictions:
step: locus_to_gene
step.run_mode: predict
step.predictions_path: gs://genetics-portal-dev-analysis/xg1/Chembl_l2g_goldstandards/Predictions/EGL_predictions_2410_freeze6_stringent
step.feature_matrix_path: gs://ot-team/irene/l2g/12112024/feature_matrix
step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
step.hf_hub_repo_id: opentargets/locus_to_gene_egl_stringent
step.session.spark_uri: yarn
It looks one to one the same. Very suspicious. Also, did you filter by protein coding genes only?
Following up from my previous comment, and to check whether the problem was coming from the changes in the nbh features, I've rerun the model with the ETL curation and the 12/11 feature matrix. Results look good this time:
# This is all genes
+----+------+
|hits| count|
+----+------+
| >2| 23024|
| 1|250890|
| 2| 81803|
+----+------+
@addramir In fact, these features seem to perform slightly better than the previous approach where nbh features were substracted and not divided.
Again, my hypothesis is that @xyg123 model must have a big overfitting problem.
# config
- id: l2g_train
# prerequisites:
# - l2g_feature_matrix
params:
step: locus_to_gene
step.run_mode: train
step.wandb_run_name: 12/11_curation_1211_fm
step.hf_hub_repo_id: opentargets/locus_to_gene
step.hf_model_commit_message: "chore: model trained with 12/11 fm and curation gs"
step.model_path: gs://ot-team/irene/l2g/12112024-etl/locus_to_gene_model/classifier.skops
step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
step.variant_index_path: gs://ot_orchestration/releases/24.10_freeze6/variant_index
step.feature_matrix_path: gs://ot-team/irene/l2g/12112024/feature_matrix
step.gold_standard_curation_path: gs://genetics_etl_python_playground/input/l2g/gold_standard/curation.json
step.gene_interactions_path: gs://genetics_etl_python_playground/static_assets/interaction # OTP 23.12 data
step.hyperparameters.n_estimators: 100
step.hyperparameters.max_depth: 5
step.hyperparameters.loss: log_loss
+step.session.extended_spark_conf: "{spark.kryoserializer.buffer.max:500m, spark.sql.autoBroadcastJoinThreshold:'-1'}"
# prerequisites:
# - l2g_feature_matrix
- id: l2g_predict
prerequisites:
- l2g_train
params:
step: locus_to_gene
step.run_mode: predict
step.predictions_path: gs://ot-team/irene/l2g/12112024-etl/locus_to_gene_predictions
step.feature_matrix_path: gs://ot-team/irene/l2g/12112024/feature_matrix
step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
step.hf_hub_repo_id: opentargets/locus_to_gene
step.session.spark_uri: yarn
prerequisites:
- l2g_train
New models based on a new feature matrix that includes the changes explained in 12/11_curation_1211_fm
AND that only considers protein coding genes when calculating neighbourhood features (this is what we internally agreed to call feature matrix 0
).
Better results in terms of selecting one single hit. 1. We have more credible sets with at least one hit
2. Increase is not happening at the expense of being less selective
There are 104,464 credible sets for which we prioritise 2 or more genes. This still represents ~22% of the total data, but we have improved the ratio of credible sets with more than 2 hits.
hits | count_freeze6 | count_latest_run |
---|---|---|
1 | 236,372 | 283,231 |
2 | 80,396 | 86,768 |
>2 | 25,930 | 17,696 |
(latest run refers to training/predicting based on protein coding genes)
# Another model i did without training/predicting based on protein coding only also showed better results than yesterday's iter
+----+------+
|hits| count|
+----+------+
| >2| 13029|
| 1|276153|
| 2| 73333|
+----+------+
Steps config:
- id: l2g_train
# prerequisites:
# - l2g_feature_matrix
params:
step: locus_to_gene
step.run_mode: train
step.wandb_run_name: 13/11_curation_1311_fm_0
step.hf_hub_repo_id: opentargets/locus_to_gene
step.hf_model_commit_message: "chore: model trained with 13/11 fm 0 and curation gs"
step.model_path: gs://ot-team/irene/l2g/13112024/locus_to_gene_model_0/classifier.skops
step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
step.variant_index_path: gs://ot_orchestration/releases/24.10_freeze6/variant_index
step.feature_matrix_path: gs://ot-team/irene/l2g/13112024/feature_matrix
step.gold_standard_curation_path: gs://genetics_etl_python_playground/input/l2g/gold_standard/curation.json
step.gene_interactions_path: gs://genetics_etl_python_playground/static_assets/interaction # OTP 23.12 data
step.hyperparameters.n_estimators: 100
step.hyperparameters.max_depth: 5
step.hyperparameters.loss: log_loss
+step.session.extended_spark_conf: "{spark.kryoserializer.buffer.max:500m, spark.sql.autoBroadcastJoinThreshold:'-1'}"
# prerequisites:
# - l2g_feature_matrix
- id: l2g_predict
prerequisites:
- l2g_train
params:
step: locus_to_gene
step.run_mode: predict
step.predictions_path: gs://ot-team/irene/l2g/13112024/locus_to_gene_predictions_0
step.feature_matrix_path: gs://ot-team/irene/l2g/13112024/feature_matrix
step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
step.hf_hub_repo_id: opentargets/locus_to_gene
step.session.spark_uri: yarn
prerequisites:
- l2g_train
https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/xe52ckzb?nw=nwuseropentargets
New model based on a new feature matrix (feature matrix 1
). Here we are testing a different approach for neighbourhood features again: if there is no data for any gene, we set the feature to 1. This kind of follows the same rationality of defining the neighbourhood feature in terms of how similar the feature for that credible set/gene is to the best result in the credible set gene. 1 meaning more similar; 0 more dissimilar. Given the sparsity of the data, this is quite a big change.
Results are worse than the previous iteration, therefore we are ditching this approach.
+----+------+
|hits| count|
+----+------+
| >2| 20958|
| 1|267721|
| 2| 89665|
+----+------+
This is a bit off-topic but I've joined the feature matrix of the gold standard used in the production version of Genetics with our best feature matrix to date (gs://ot-team/irene/l2g/13112024/feature_matrix
), to try to compare.
I used the study, variant, and gene IDs to overlap both matrices. Of course I didn't get a full coverage, but it's decent to check some examples
+-------+---------------+-----+
|isMatch|goldStandardSet|count|
+-------+---------------+-----+
| true| 1| 120|. --> Positives that overlap
| true| 0| 255| --> Negatives that overlap
| false| 0| 4523|. --> Negatives that don't overlap
| false| 1| 232|. --> Positives that don't overlap
+-------+---------------+-----+
# isMatch=True means that we have annotation from production + new FM
# goldStandardSet=1 is a positive in the production model, 0 is a negative
I checked 2 examples from the isMatch=True/goldStandardSet=1 set, and features look really different in general except for VEP. So it's not trivial to compare numbers. One instance:
# The first number corresponds to production, the second to the new FM
studyId | GCST000697
variantId | 11_71456403_G_T
geneId | ENSG00000172893
goldStandardSet | 1
isMatch | true
distanceSentinelTssNeighbourhood | {-0.07058997682092816, 0.9992582}
pQtlColocClppMaximumNeighbourhood | {NULL, 0.0}
pQtlColocH4Maximum | {NULL, 0.0}
sQtlColocClppMaximumNeighbourhood | {NULL, 0.8942154}
distanceSentinelFootprintNeighbourhood | {-8.170751423757535, 0.9987693}
pQtlColocClppMaximum | {NULL, 0.0}
eQtlColocClppMaximum | {-4.412610208756248, 0.017186046}
score | {0.08474060148000717, 0.9262663000832196}
vepMaximumNeighbourhood | {0.15151515151515152, 0.5}
distanceFootprintMean | {9.048126517379467, 0.99753845}
vepMean | {0.00999822290936213, 1.00455785E-4}
sQtlColocH4MaximumNeighbourhood | {NULL, 0.0}
distanceSentinelFootprint | {8.170751423757535, 0.9987693}
sQtlColocH4Maximum | {NULL, 0.0}
distanceTssMeanNeighbourhood | {-0.02001154467391153, 0.99940073}
pQtlColocH4MaximumNeighbourhood | {NULL, 0.0}
vepMeanNeighbourhood | {0.11433660812052467, 0.07359571}
geneCount500kb | {13, 22.0}
eQtlColocClppMaximumNeighbourhood | {-0.5586822723679621, 0.8503973}
distanceSentinelTss | {8.170751423757535, 0.9987693}
sQtlColocClppMaximum | {NULL, 0.01753}
distanceFootprintMeanNeighbourhood | {-2.9054150635869664, 0.9983251}
eQtlColocH4MaximumNeighbourhood | {NULL, 0.0}
vepMaximum | {0.1, 0.33}
distanceTssMean | {9.097634560962991, 0.99748087}
eQtlColocH4Maximum | {NULL, 0.0}
It's interesting to compare scores:
PRODUCTION | Positives | Negatives |
---|---|---|
score < 0.5 | 37 | 232 |
score >= 0.5 | 83 | 23 |
120 | 255 |
NEW | Positives | Negatives |
---|---|---|
score < 0.5 | 111 | 197 |
score >= 0.5 | 9 | 58 |
120 | 255 |
This kind of correlates with our impression that our L2G scores are higher than production. One of the causes likely being that our feature matrix is richer:
+-------+--------------------+--------------------+
|summary| old_score| new_score|
+-------+--------------------+--------------------+
| count| 375| 375|
| mean| 0.27418680242449045| 0.4754586410401808|
| stddev| 0.3048912971331486| 0.37092390686746796|
| min|0.008100325241684914|0.050166188832806664|
| max| 0.9093976616859436| 0.9940283410709758|
+-------+--------------------+--------------------+
👉 Dataset gs://ot-team/irene/l2g/13112024/feature_matrix_comparison_w_production
Gist: https://gist.github.com/ireneisdoomed/df08850b94cf4553d6c1c3df0a3c0afb
https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/lc5323st?nw=nwuseropentargets
FM0 is used. Training is based on 452 GSPs.
Confusion matrix is rather bad. However predictions are very selective and conservative.
Predictions:
gs://ot_orchestration/benchmarks/l2g/fm0/gs_v1/locus_to_gene_predictions
l2g>=0.5
+----+------+
|hits| count|
+----+------+
| >2| 1908 |
| 1|240906|
| 2| 17836 |
+----+------+
https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/d4gvgarn?nw=nwuseropentargets
FM0 is used. Training is based on 452 GSPs but without selection of CSs.
Confusion matrix is rather good. Predictions are very selective and conservative.
Predictions:
gs://ot_orchestration/benchmarks/l2g/fm0/gs_v4/locus_to_gene_predictions
l2g>=0.5
+----+------+
|hits| count|
+----+------+
| >2| 1014 |
| 1| 240951|
| 2| 12794|
+----+------+
https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/ayzj3vtq?nw=nwuseropentargets
FM0 is used. Training is based on whatever is integrated into ETL.
Following @ireneisdoomed numbers. For some reason my numbers a bit different using the provided predictions (and filtering by isProteinCoding).
Predictions:
gs://ot-team/irene/l2g/13112024/locus_to_gene_predictions_0
When using l2g>=0.5:
+----+------+
|hits| count|
+----+------+
| >2| 17696 |
| 1| 283231 |
| 2| 86768 |
+----+------+
Hover using l2g>=0.8 makes everything better:
+----+------+
|hits| count|
+----+------+
| >2| 658 |
| 1| 259860 |
| 2| 21650 |
+----+------+
https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/i3yvd0dj?nw=nwuseropentargets
FM0 is used. Training is based on ~560 GSPs.
Confusion matrix is rather good. Predictions are relatively selective but conservative. So far probably the best model, but on few selected cases it shows relatively conservative results.
Predictions:
gs://ot_orchestration/benchmarks/l2g/fm0/gs_v5/locus_to_gene_predictions
l2g>=0.5
+----+------+
|hits| count|
+----+------+
| >2| 2308 |
| 1| 287451 |
| 2| 29301 |
+----+------+
The table below contains the orchestration tags that can be used to reproduce the locus_to_gene
step runs for conducted with various arbitrary gold standards:
Training run | orchestration tag |
---|---|
fm_v0_training_v1 | l2g-fm_v0_gs_v1 |
fm_v0_training_v2 | l2g-fm_v0_gs_v2 |
fm_v0_training_v3 | l2g-fm_v0_gs_v3 |
fm_v0_training_v4 | l2g-fm_v0_gs_v4 |
fm_v0_training_v5 | l2g-fm_v0_gs_v5 |
To further validate the predictions, there are some numbers comparing old L2Gs with the new ones:
Assuming these are the last old L2Gs: `gs://genetics-portal-dev-staging/l2g/220712/predictions/l2g.full.220712.parquet
.
The new ones match the models.
Both data sets were filtered for protein coding only and for L2G>=0.05.
In average the intersect corresponds to ~130-140K credible sets. so we don't really expect more than 130K genes priortitized by both methods on any threshold.
Predictions: gs://ot-team/irene/l2g/13112024/locus_to_gene_predictions_0
Overlap with the old by studyId and geneId: 352633 rows Number of unique studyIds: 6367 Correlation between L2Gs: 0.521
Using L2G>=0.5 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |82195 |
|1 |1 |91527 |
|0 |0 |151760|
|0 |1 |27151 |
+---------+---------+------+
82K new genes. 91K resemblance. 27K is lost.
Using L2G>=0.8 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |91885 |
|1 |1 |22517 |
|0 |0 |230092|
|0 |1 |8139 |
+---------+---------+------+
Even more new genes: 91K 23K resemblance. 8K is lost.
Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v4/locus_to_gene_predictions
Overlap with the old by studyId and geneId: 305543 rows Number of unique studyIds: 6346 Correlation between L2Gs: 0.477
Using L2G>=0.5 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |35738 |
|1 |1 |61477 |
|0 |0 |155334|
|0 |1 |52994 |
+---------+---------+------+
Using L2G>=0.8 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |22487 |
|1 |1 |9123 |
|0 |0 |252780|
|0 |1 |21153 |
+---------+---------+------+
Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v5/locus_to_gene_predictions
Overlap with the old by studyId and geneId: 298412 rows Number of unique studyIds: 6360 Correlation between L2Gs: 0.516
Using L2G>=0.5 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |48281 |
|1 |1 |79772 |
|0 |0 |136099|
|0 |1 |34260 |
+---------+---------+------+
Using L2G>=0.8 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |55173 |
|1 |1 |16767 |
|0 |0 |213264|
|0 |1 |13208 |
+---------+---------+------+
gs://ot_orchestration/benchmarks/l2g/fm0/gs_v1/locus_to_gene_predictions
Overlap with the old by studyId and geneId: 272541 rows Number of unique studyIds: 6251 Correlation between L2Gs: 0.457
Using L2G>=0.5 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |35501 |
|1 |1 |63200 |
|0 |0 |125779|
|0 |1 |48061 |
+---------+---------+------+
Using L2G>=0.8 as threshold, the concordance between datasets:
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |30012 |
|1 |1 |9836 |
|0 |0 |212540|
|0 |1 |20153 |
+---------+---------+------+
Conclusions:
For ~6K studies we don't really expect too many new genes to be prioritized. So if the difference between how many new genes and how many genes we loose is big - it probably indicates high FDR.
For all four models, there are too many cases where the model "changes its mind" about prioritisation (+new genes, -old genes). Currently investigating why.
@addramir what about the old L2G > 0.5 and new L2G > 0.8?
@project-defiant calculated for fm_v0_training_v5. New l2g>0.8, old l2g>0.5
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |23429 |
|1 |1 |48511 |
|0 |0 |160951|
|0 |1 |65521 |
+---------+---------+------+
Seems that the score has shifted forwards gs+
Some benchmarking of fm0_v5.1_best_cv model.
Overlap with old L2G: 6372 studies. Correlation between new and old L2G: 0.5866224203281499
L2G>=0.5
+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1 |0 |52686 |
|1 |1 |91679 |
|0 |0 |127666|
|0 |1 |22447 |
+---------+---------+------+
1 per cs: 345498
2 per cs: 18619
>2 per cs: 2227
As a developer I want to monitor the number of hits in every L2G iteration because it is an important indicator tof the model's selectivity
Background
The number of genes that receive a score above 0.5 for the same credible set is an interesting metric of the model's selectivity. We observe a negative tendency in the last 3 iterations, the percentage of credible sets with a single prioritised gene has decreases, while the percentage with 2 or more has increased.
This metric is not something that we monitor during training.
Tasks