opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Improve L2G model selectivity #3610

Open ireneisdoomed opened 2 weeks ago

ireneisdoomed commented 2 weeks ago

As a developer I want to monitor the number of hits in every L2G iteration because it is an important indicator tof the model's selectivity

Background

The number of genes that receive a score above 0.5 for the same credible set is an interesting metric of the model's selectivity. We observe a negative tendency in the last 3 iterations, the percentage of credible sets with a single prioritised gene has decreases, while the percentage with 2 or more has increased.

Freeze 4 (curated gold standards)
+----+------+-----+                                                             
|hits| count|    %|
+----+------+-----+
|   1|277894|76.32|
|   2| 65886| 18.1|
|  >2| 20327| 5.58|
+----+------+-----+

Iteration 5 (curated gold standards)
+----+------+-----+                                                             
|hits| count|    %|
+----+------+-----+
|   1|312521|74.69|
|   2| 78196|18.69|
|  >2| 27734| 6.63|
+----+------+-----+

Iteration 6 (gold standards based on L2G hits in production)
+----+------+-----+                                                             
|hits| count|    %|
+----+------+-----+
|   1|236372|68.97|
|   2| 80396|23.46|
|  >2| 25930| 7.57|
+----+------+-----+

This metric is not something that we monitor during training.

Tasks

[ ] Monitor this metric in future iterations
[ ] How can we model the features so that the distinction between genes is clearer?
[ ] Can we implement a strategy in the selection of positives/negatives in the gold standards to prioritise one over the other?

ireneisdoomed commented 2 weeks ago

Analysis of credible sets with multiple prioritised genes (L2G >= 0.5)

I have prepared a dataset with cases where the assigned genes are more than 1, and I have prepared some quality metrics based on the availability of features. So that now I have a dataframe like this one:

+--------------------------------+---------------+------------------+----------------------------------------------+
|studyLocusId                    |geneId         |score             |qualityControls                               |
+--------------------------------+---------------+------------------+----------------------------------------------+
|696aa17876038f0b46ac65c5e27d3fa7|ENSG00000075213|0.982434345980689 |[]                                            |
|696aa17876038f0b46ac65c5e27d3fa7|ENSG00000170381|0.7804732926950384|[VEP features missing, Coloc features missing]|
+--------------------------------+---------------+------------------+----------------------------------------------+

Looking at cases like this, it is clear why ENSG00000075213 is the top ranked gene. And it is clear as well that, regardless of the iterations we do on the VEP or Colocalisation features to finetune them, we are not going to improve the selectivity because we simply lack any functional genomics data for ENSG00000170381.

My question was to analyse systematically how many times we have this scenario.

Take aways

There are 106,326 unique affected credible sets (22%), where there is more than one gene with a score > 0.5
Among those, in 75% of cases we prioritise 2 genes only, whereas we have more than 2 causal genes for 25,930 credible sets
In 43% of cases, there is a significant (>= 0.15) score gap between the top and second-ranked genes.
There is a significant amount of missing data, with over 50% of predictions having at least a feature group completely missing (VEP, Coloc, or Distance). As we know, this is a major limitation in our current model
However, in ~34% of cases, the top-ranked gene has no quality control flags while the other genes do. This indicates that the model performs well when annotation for all feature groups is available

I think the most informative table is this one, where I try to understand the source of selectivity issues:

+-----------------------------------------------------------+--------------+------------------------------------------+
|flags                                                      |granular_count|general_flag                              |
+-----------------------------------------------------------+--------------+------------------------------------------+
|[First and incomplete VEP, Second and incomplete VEP]      |31002         |Both Incomplete - same group              |
|[Second and incomplete VEP, First and complete features]   |17318         |Only first complete                       |
|[Second and incomplete VEP, First and incomplete Coloc]    |15491         |Both Incomplete - different groups        |
|[Second and complete features, First and complete features]|14189         |Both Complete                             |
|[Second and incomplete Coloc, First and incomplete Coloc]  |7475          |Both Incomplete - same group              |
|[First and incomplete VEP, Second and complete features]   |5308          |Only second complete                      |
|[Second and incomplete Coloc, First and complete features] |4895          |Only first complete                       |
|[Second and incomplete Coloc, First and incomplete VEP]    |4597          |Both Incomplete - different groups        |
|[Second and complete features, First and incomplete Coloc] |3248          |Only second complete                      |
|[First and incomplete VEP]                                 |2764          |Genes have same score                     |
|[First and incomplete Coloc]                               |32            |Genes have same score                     |
|[First and complete features]                              |6             |Genes have same score                     |
|[First and incomplete Coloc, First and complete features]  |1             |Genes have same score - different features|
+-----------------------------------------------------------+--------------+------------------------------------------+

The largest category, 31,002 cases, is where both the first and second ranked genes don't have VEP information. This means that at least they have Distance and Coloc annotation, which is susceptible of improving.
The second largest category is Only first complete, with 22,213 credible sets, where changes in the feature modelling won't yield any benefits because simply there's no data. This is the case of the example highlighted above.
It's also interesting the 14,189 cases where both genes are complete. Also another scenario where a better modelling could improve the selectivity.

It'd be good to reproduce this in future iterations. Code is here https://gist.github.com/ireneisdoomed/31086fe5ef8ddb76dd06d22c54950e2d

ireneisdoomed commented 1 week ago

These are the predictions with the feature matrix computed taking the max in the neighbourhood (PR), and using the curation for model training

gs://ot-team/irene/l2g/08112024/locus_to_gene_predictions

Selectivity looks slightly better, with 99,541 credible sets with multiple genes

+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 24192|
|   1|247555|
|   2| 75349|
+----+------+

ireneisdoomed commented 1 week ago

SOA in the L2G Paper

From @addramir:

55.4% loci had a single gene with l2g>0.5 1.4% had >=2 genes with l2g>0.5 43.2% had 0 genes with l2g>0.5

ireneisdoomed commented 1 week ago

12/11_GSP_cs_selected_freez6_v5_nbh_ratio

Another set of predictions testing 2 things:

Less filtered and much bigger gold standard: Based on old L2G hits, here credible sets are picked up based on distance, and not based on the availability of functional genomics data.
```
+---------------+------+                                                        
|goldStandardSet| count|
+---------------+------+
|       positive| 15863|
|       negative|332719|
+---------------+------+
```
Path: gs://genetics-portal-dev-analysis/yt4/20241024_EGL_playground/GSP_cs_selected_freez6_v5.parquet
Redefinition of the neihbourhood features to match the old L2G. For every credible set, instead of substracting the max in the region, we divide them. Path: gs://ot-team/irene/l2g/12112024/feature_matrix

Predictions are really unselective, we have flipped the coin. gs://ot-team/irene/l2g/12112024/locus_to_gene_predictions/"

+----+------+                                                                   
|hits| count|
+----+------+
|  >2|453793|
|   1|  7812|
|   2| 10113|
+----+------+

IMO we are observing a huge overfitting here. I think it'd be a very good idea to plot a learning curve to decide training set size and to diagnose how fit is the model.

ireneisdoomed commented 1 week ago

locus_to_gene_egl_stringent

This model was trained by @@xyg123 based on a selected list of Effector gene pairs. Model used is here opentargets/locus_to_gene_egl_stringent

Strikingly, the selectivity is exactly the same as the previous results based on a very different gold standard.

+----+------+                                                                   
|hits| count|
+----+------+
|  >2|453793|
|   1|  7812|
|   2| 10113|
+----+------+

The commonality between 2 runs is the feature matrix, I'm going to train the model with a previous GS to see if the new neighbourhood features are the culprit.

This is the config for these predictions:

step: locus_to_gene
      step.run_mode: predict
      step.predictions_path: gs://genetics-portal-dev-analysis/xg1/Chembl_l2g_goldstandards/Predictions/EGL_predictions_2410_freeze6_stringent
      step.feature_matrix_path: gs://ot-team/irene/l2g/12112024/feature_matrix
      step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
      step.hf_hub_repo_id: opentargets/locus_to_gene_egl_stringent
      step.session.spark_uri: yarn

addramir commented 1 week ago

It looks one to one the same. Very suspicious. Also, did you filter by protein coding genes only?

ireneisdoomed commented 1 week ago

12/11_curation_1211_fm

Following up from my previous comment, and to check whether the problem was coming from the changes in the nbh features, I've rerun the model with the ETL curation and the 12/11 feature matrix. Results look good this time:

# This is all genes
+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 23024|
|   1|250890|
|   2| 81803|
+----+------+

@addramir In fact, these features seem to perform slightly better than the previous approach where nbh features were substracted and not divided.

Again, my hypothesis is that @xyg123 model must have a big overfitting problem.

# config
  - id: l2g_train
    # prerequisites:
    #   - l2g_feature_matrix
    params:
      step: locus_to_gene
      step.run_mode: train
      step.wandb_run_name: 12/11_curation_1211_fm
      step.hf_hub_repo_id: opentargets/locus_to_gene
      step.hf_model_commit_message: "chore: model trained with 12/11 fm and curation gs"
      step.model_path: gs://ot-team/irene/l2g/12112024-etl/locus_to_gene_model/classifier.skops
      step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
      step.variant_index_path: gs://ot_orchestration/releases/24.10_freeze6/variant_index
      step.feature_matrix_path: gs://ot-team/irene/l2g/12112024/feature_matrix
      step.gold_standard_curation_path: gs://genetics_etl_python_playground/input/l2g/gold_standard/curation.json
      step.gene_interactions_path: gs://genetics_etl_python_playground/static_assets/interaction # OTP 23.12 data
      step.hyperparameters.n_estimators: 100
      step.hyperparameters.max_depth: 5
      step.hyperparameters.loss: log_loss
      +step.session.extended_spark_conf: "{spark.kryoserializer.buffer.max:500m, spark.sql.autoBroadcastJoinThreshold:'-1'}"
    # prerequisites:
    #   - l2g_feature_matrix
  - id: l2g_predict
    prerequisites:
      - l2g_train
    params:
      step: locus_to_gene
      step.run_mode: predict
      step.predictions_path: gs://ot-team/irene/l2g/12112024-etl/locus_to_gene_predictions
      step.feature_matrix_path: gs://ot-team/irene/l2g/12112024/feature_matrix
      step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
      step.hf_hub_repo_id: opentargets/locus_to_gene
      step.session.spark_uri: yarn
    prerequisites:
      - l2g_train

ireneisdoomed commented 1 week ago

New models based on a new feature matrix that includes the changes explained in 12/11_curation_1211_fm AND that only considers protein coding genes when calculating neighbourhood features (this is what we internally agreed to call feature matrix 0).

13/11_curation_1311_fm_0

Better results in terms of selecting one single hit. 1. We have more credible sets with at least one hit

Out of 455,950, 387,695 (85%) have a prediction with a score higher than 0.5. This is +45k credible sets compared to freeze 6

2. Increase is not happening at the expense of being less selective

There are 104,464 credible sets for which we prioritise 2 or more genes. This still represents ~22% of the total data, but we have improved the ratio of credible sets with more than 2 hits.

hits count_freeze6 count_latest_run

1 236,372 283,231

2 80,396 86,768

>2 25,930 17,696

hits	count_freeze6	count_latest_run
1	236,372	283,231
2	80,396	86,768
>2	25,930	17,696

(latest run refers to training/predicting based on protein coding genes)

Side note

# Another model i did without training/predicting based on protein coding only also showed better results than yesterday's iter
+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 13029|
|   1|276153|
|   2| 73333|
+----+------+

Steps config:

  - id: l2g_train
    # prerequisites:
    #   - l2g_feature_matrix
    params:
      step: locus_to_gene
      step.run_mode: train
      step.wandb_run_name: 13/11_curation_1311_fm_0
      step.hf_hub_repo_id: opentargets/locus_to_gene
      step.hf_model_commit_message: "chore: model trained with 13/11 fm 0 and curation gs"
      step.model_path: gs://ot-team/irene/l2g/13112024/locus_to_gene_model_0/classifier.skops
      step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
      step.variant_index_path: gs://ot_orchestration/releases/24.10_freeze6/variant_index
      step.feature_matrix_path: gs://ot-team/irene/l2g/13112024/feature_matrix
      step.gold_standard_curation_path: gs://genetics_etl_python_playground/input/l2g/gold_standard/curation.json
      step.gene_interactions_path: gs://genetics_etl_python_playground/static_assets/interaction # OTP 23.12 data
      step.hyperparameters.n_estimators: 100
      step.hyperparameters.max_depth: 5
      step.hyperparameters.loss: log_loss
      +step.session.extended_spark_conf: "{spark.kryoserializer.buffer.max:500m, spark.sql.autoBroadcastJoinThreshold:'-1'}"
    # prerequisites:
    #   - l2g_feature_matrix
  - id: l2g_predict
    prerequisites:
      - l2g_train
    params:
      step: locus_to_gene
      step.run_mode: predict
      step.predictions_path: gs://ot-team/irene/l2g/13112024/locus_to_gene_predictions_0
      step.feature_matrix_path: gs://ot-team/irene/l2g/13112024/feature_matrix
      step.credible_set_path: gs://ot_orchestration/releases/24.10_freeze6/credible_set
      step.hf_hub_repo_id: opentargets/locus_to_gene
      step.session.spark_uri: yarn
    prerequisites:
      - l2g_train

ireneisdoomed commented 1 week ago

13/11_curation_1311_fm_1

https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/xe52ckzb?nw=nwuseropentargets

New model based on a new feature matrix (feature matrix 1). Here we are testing a different approach for neighbourhood features again: if there is no data for any gene, we set the feature to 1. This kind of follows the same rationality of defining the neighbourhood feature in terms of how similar the feature for that credible set/gene is to the best result in the credible set gene. 1 meaning more similar; 0 more dissimilar. Given the sparsity of the data, this is quite a big change.

Results are worse than the previous iteration, therefore we are ditching this approach.

+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 20958|
|   1|267721|
|   2| 89665|
+----+------+

ireneisdoomed commented 1 week ago

Feature comparison with Production

This is a bit off-topic but I've joined the feature matrix of the gold standard used in the production version of Genetics with our best feature matrix to date (gs://ot-team/irene/l2g/13112024/feature_matrix), to try to compare.

I used the study, variant, and gene IDs to overlap both matrices. Of course I didn't get a full coverage, but it's decent to check some examples

+-------+---------------+-----+
|isMatch|goldStandardSet|count|
+-------+---------------+-----+
|   true|              1|  120|.   --> Positives that overlap
|   true|              0|  255|   --> Negatives that overlap
|  false|              0| 4523|.   --> Negatives that don't overlap
|  false|              1|  232|.   --> Positives that don't overlap
+-------+---------------+-----+
# isMatch=True means that we have annotation from production + new FM
# goldStandardSet=1 is a positive in the production model, 0 is a negative

I checked 2 examples from the isMatch=True/goldStandardSet=1 set, and features look really different in general except for VEP. So it's not trivial to compare numbers. One instance:

# The first number corresponds to production, the second to the new FM
 studyId                                | GCST000697                                
 variantId                              | 11_71456403_G_T                           
 geneId                                 | ENSG00000172893                           
 goldStandardSet                        | 1                                         
 isMatch                                | true                                      
 distanceSentinelTssNeighbourhood       | {-0.07058997682092816, 0.9992582}         
 pQtlColocClppMaximumNeighbourhood      | {NULL, 0.0}                               
 pQtlColocH4Maximum                     | {NULL, 0.0}                               
 sQtlColocClppMaximumNeighbourhood      | {NULL, 0.8942154}                         
 distanceSentinelFootprintNeighbourhood | {-8.170751423757535, 0.9987693}           
 pQtlColocClppMaximum                   | {NULL, 0.0}                               
 eQtlColocClppMaximum                   | {-4.412610208756248, 0.017186046}         
 score                                  | {0.08474060148000717, 0.9262663000832196} 
 vepMaximumNeighbourhood                | {0.15151515151515152, 0.5}                
 distanceFootprintMean                  | {9.048126517379467, 0.99753845}           
 vepMean                                | {0.00999822290936213, 1.00455785E-4}      
 sQtlColocH4MaximumNeighbourhood        | {NULL, 0.0}                               
 distanceSentinelFootprint              | {8.170751423757535, 0.9987693}            
 sQtlColocH4Maximum                     | {NULL, 0.0}                               
 distanceTssMeanNeighbourhood           | {-0.02001154467391153, 0.99940073}        
 pQtlColocH4MaximumNeighbourhood        | {NULL, 0.0}                               
 vepMeanNeighbourhood                   | {0.11433660812052467, 0.07359571}         
 geneCount500kb                         | {13, 22.0}                                
 eQtlColocClppMaximumNeighbourhood      | {-0.5586822723679621, 0.8503973}          
 distanceSentinelTss                    | {8.170751423757535, 0.9987693}            
 sQtlColocClppMaximum                   | {NULL, 0.01753}                           
 distanceFootprintMeanNeighbourhood     | {-2.9054150635869664, 0.9983251}          
 eQtlColocH4MaximumNeighbourhood        | {NULL, 0.0}                               
 vepMaximum                             | {0.1, 0.33}                               
 distanceTssMean                        | {9.097634560962991, 0.99748087}           
 eQtlColocH4Maximum                     | {NULL, 0.0}

It's interesting to compare scores:

Correlation between old and new scores for the gold standards that overlap (375 cases): 0.669
When I look at the positives that overlap, 30% of them have a score lower than 0.5 in production. These, with the new results, represent a 7.5% only.
When I look at the negatives that overlap, 9% of them have a score higher than 0.5. Our L2G score is higher in this case, where 22% of the negatives have a score higher than > 0.5.

PRODUCTION	Positives	Negatives
score < 0.5	37	232
score >= 0.5	83	23
	120	255

NEW	Positives	Negatives
score < 0.5	111	197
score >= 0.5	9	58
	120	255

This kind of correlates with our impression that our L2G scores are higher than production. One of the causes likely being that our feature matrix is richer:

+-------+--------------------+--------------------+
|summary|           old_score|           new_score|
+-------+--------------------+--------------------+
|  count|                 375|                 375|
|   mean| 0.27418680242449045|  0.4754586410401808|
| stddev|  0.3048912971331486| 0.37092390686746796|
|    min|0.008100325241684914|0.050166188832806664|
|    max|  0.9093976616859436|  0.9940283410709758|
+-------+--------------------+--------------------+

👉 Dataset gs://ot-team/irene/l2g/13112024/feature_matrix_comparison_w_production Gist: https://gist.github.com/ireneisdoomed/df08850b94cf4553d6c1c3df0a3c0afb

addramir commented 1 week ago

fm_v0_training_v1

https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/lc5323st?nw=nwuseropentargets

FM0 is used. Training is based on 452 GSPs.

Confusion matrix is rather bad. However predictions are very selective and conservative. Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v1/locus_to_gene_predictions

l2g>=0.5

+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 1908 |
|   1|240906|
|   2| 17836 |
+----+------+

addramir commented 1 week ago

fm_v0_training_v4

https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/d4gvgarn?nw=nwuseropentargets

FM0 is used. Training is based on 452 GSPs but without selection of CSs.

Confusion matrix is rather good. Predictions are very selective and conservative.

Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v4/locus_to_gene_predictions l2g>=0.5

+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 1014 |
|   1| 240951|
|   2| 12794|
+----+------+

addramir commented 1 week ago

13/11_curation_1311_fm_0

https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/ayzj3vtq?nw=nwuseropentargets

FM0 is used. Training is based on whatever is integrated into ETL.

Following @ireneisdoomed numbers. For some reason my numbers a bit different using the provided predictions (and filtering by isProteinCoding).

Predictions: gs://ot-team/irene/l2g/13112024/locus_to_gene_predictions_0

When using l2g>=0.5:

+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 17696 |
|   1| 283231 |
|   2| 86768 |
+----+------+

Hover using l2g>=0.8 makes everything better:

+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 658 |
|   1| 259860 |
|   2| 21650 |
+----+------+

addramir commented 1 week ago

fm_v0_training_v5

https://wandb.ai/open-targets/gentropy-locus-to-gene/runs/i3yvd0dj?nw=nwuseropentargets

FM0 is used. Training is based on ~560 GSPs.

Confusion matrix is rather good. Predictions are relatively selective but conservative. So far probably the best model, but on few selected cases it shows relatively conservative results.

Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v5/locus_to_gene_predictions

l2g>=0.5

+----+------+                                                                   
|hits| count|
+----+------+
|  >2| 2308 |
|   1| 287451 |
|   2| 29301 |
+----+------+

project-defiant commented 1 week ago

L2G training with arbitrary gold standards

The table below contains the orchestration tags that can be used to reproduce the locus_to_gene step runs for conducted with various arbitrary gold standards:

Training run	orchestration tag
fm_v0_training_v1	l2g-fm_v0_gs_v1
fm_v0_training_v2	l2g-fm_v0_gs_v2
fm_v0_training_v3	l2g-fm_v0_gs_v3
fm_v0_training_v4	l2g-fm_v0_gs_v4
fm_v0_training_v5	l2g-fm_v0_gs_v5

addramir commented 1 week ago

To further validate the predictions, there are some numbers comparing old L2Gs with the new ones: Assuming these are the last old L2Gs: `gs://genetics-portal-dev-staging/l2g/220712/predictions/l2g.full.220712.parquet. The new ones match the models. Both data sets were filtered for protein coding only and for L2G>=0.05. In average the intersect corresponds to ~130-140K credible sets. so we don't really expect more than 130K genes priortitized by both methods on any threshold.

13/11_curation_1311_fm_0

Predictions: gs://ot-team/irene/l2g/13112024/locus_to_gene_predictions_0

Overlap with the old by studyId and geneId: 352633 rows Number of unique studyIds: 6367 Correlation between L2Gs: 0.521

Using L2G>=0.5 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |82195 |
|1        |1        |91527 |
|0        |0        |151760|
|0        |1        |27151 |
+---------+---------+------+

82K new genes. 91K resemblance. 27K is lost.

Using L2G>=0.8 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |91885 |
|1        |1        |22517 |
|0        |0        |230092|
|0        |1        |8139  |
+---------+---------+------+

Even more new genes: 91K 23K resemblance. 8K is lost.

fm_v0_training_v4

Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v4/locus_to_gene_predictions

Overlap with the old by studyId and geneId: 305543 rows Number of unique studyIds: 6346 Correlation between L2Gs: 0.477

Using L2G>=0.5 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |35738 |
|1        |1        |61477 |
|0        |0        |155334|
|0        |1        |52994 |
+---------+---------+------+

Using L2G>=0.8 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |22487 |
|1        |1        |9123  |
|0        |0        |252780|
|0        |1        |21153 |
+---------+---------+------+

fm_v0_training_v5

Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v5/locus_to_gene_predictions

Overlap with the old by studyId and geneId: 298412 rows Number of unique studyIds: 6360 Correlation between L2Gs: 0.516

Using L2G>=0.5 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |48281 |
|1        |1        |79772 |
|0        |0        |136099|
|0        |1        |34260 |
+---------+---------+------+

Using L2G>=0.8 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |55173 |
|1        |1        |16767 |
|0        |0        |213264|
|0        |1        |13208 |
+---------+---------+------+

fm_v0_training_v1 Predictions: gs://ot_orchestration/benchmarks/l2g/fm0/gs_v1/locus_to_gene_predictions

Overlap with the old by studyId and geneId: 272541 rows Number of unique studyIds: 6251 Correlation between L2Gs: 0.457

Using L2G>=0.5 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |35501 |
|1        |1        |63200 |
|0        |0        |125779|
|0        |1        |48061 |
+---------+---------+------+

Using L2G>=0.8 as threshold, the concordance between datasets:

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |30012 |
|1        |1        |9836  |
|0        |0        |212540|
|0        |1        |20153 |
+---------+---------+------+

Conclusions:

For ~6K studies we don't really expect too many new genes to be prioritized. So if the difference between how many new genes and how many genes we loose is big - it probably indicates high FDR.

13/11_curation_1311_fm_0 - too many new genes (+55K) - I don't believe this model
fm_v0_training_v4 - there is a balance between old and new, However the resemblance is relatively low.
fm_v0_training_v5 - not great not terrible. +14K new genes, that is still too much but better than before. Good resemblance. Using l2g>0.8 shows that the model is probably over confident - much more new genes (+42K)
fm_v0_training_v1 - a bit conservative model using l2g 0.5, however detects more genes on l2g>0.8. Resemblance is relatively low.

For all four models, there are too many cases where the model "changes its mind" about prioritisation (+new genes, -old genes). Currently investigating why.

project-defiant commented 1 week ago

@addramir what about the old L2G > 0.5 and new L2G > 0.8?

addramir commented 1 week ago

@project-defiant calculated for fm_v0_training_v5. New l2g>0.8, old l2g>0.5

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |23429 |
|1        |1        |48511 |
|0        |0        |160951|
|0        |1        |65521 |
+---------+---------+------+

project-defiant commented 1 week ago

Seems that the score has shifted forwards gs+

addramir commented 18 hours ago

Some benchmarking of fm0_v5.1_best_cv model.

Overlap with old L2G: 6372 studies. Correlation between new and old L2G: 0.5866224203281499

L2G>=0.5

+---------+---------+------+
|new_score|old_score|count |
+---------+---------+------+
|1        |0        |52686 |
|1        |1        |91679 |
|0        |0        |127666|
|0        |1        |22447 |
+---------+---------+------+

1 per cs: 345498
2 per cs: 18619
>2 per cs: 2227