decide how to compute "g2v score" from new v2g table

edm1 commented 6 years ago

My proposed scoring system is here https://github.com/opentargets/v2g_data/tree/master/v2g_score_prototype

We need to look at implementing this in SQL.

The above will only calculate per variant assignments (v2g score). We also want an additional layer that, for a given index variant, aggregates over tag variants to give a locus-to-gene assignment (l2g score).

iandunham commented 6 years ago

Can you write some pseudocode of the scoring logic for me to look at?

edm1 commented 6 years ago

Our cis-regulatory data is structured as:

gs://genetics-portal-staging/v2g/{data type}/{experiment type}/{source}/{version}/{tissue or cell line}/{chroms}.processed.tsv.gz

E.g. here's one tissue from each of our data sources

gs://genetics-portal-staging/v2g/interval/dhscor/thurman2012/180703/unspecified/1-23.processed.tsv.gz
gs://genetics-portal-staging/v2g/interval/fantom5/andersson2014/180703/unspecified/1-23.processed.tsv.gz
gs://genetics-portal-staging/v2g/interval/pchic/javierre2016/180703/Endothelial_precursors/1-23.processed.tsv.gz
gs://genetics-portal-staging/v2g/qtl/eqtl/gtex_v7/180703/Adipose_Subcutaneous/1-23.cis_reg.processed.tsv.gz
gs://genetics-portal-staging/v2g/qtl/pqtl/sun2018/180703/Blood_plasma/1-22.pval2.5e-05.cis_reg.processed.tsv.gz

We would like to maintain our ability to quickly add new datasets without changing the scoring algorithm too much (except maybe adding new weights). We therefore want something generalisable and non-parametric, as scores between the different experiment types are not necessarily comparable.

# Pseudocode

<deleted>

See here for latest pseudocode: opentargets/issues#2148

As mentioned above, we may actually want to calculate a 2nd score (locus-to-gene) which will be used for the main platform (as well as places in the genetics portal). In this, for a given index variant we will aggregate over tag variants in the LD or finemapping set using another weighted mean.

elipapa commented 6 years ago

i have been working on translating this plan in sql. a few things worth pointing out:

the goal is to calculate the score locally and dynamically, not precompute it for all variant-2-gene combos
when we translate this into sql there is probably no need to take (1- pvalue), we just sort in the opposite way.
the quantile transform currently used by @edm1 may be complicated to translate, but we can calculate quantiles with clickhouse and bigquery approximations and assign each value in the buckets ( as integers)

iandunham commented 6 years ago

OK this looks good. The one thing we might want to consider is protein coding change consequences and whether these should have a preferential higher priority. This was the idea in the other OT score. Otherwise you will have occurrences where there is a protein coding chane in one gene being assigned to another gene perhaps fro technical reasons e.g. there is a nearby intronic enhancer, and the functional genomics data captures the exon because of a restriction enzyme cut, or similar.

edm1 commented 6 years ago

Would the solution just be a matter of re-jigging the consequence scores outlined here?

In VEP, the consequences are assigned at the transcript level, which we have then mapped to gene. We haven't filtered out on gene-type at all.

Can one consequence term be found in both a protein-coding and non-protein-coding gene, meaning we would have to take into account the gene-type as well as the consequence term?

iandunham commented 6 years ago

VEP has a concept of most_severe consequence (run vep --most_severe) which is what we use, and this outputs the most severe consequence per variant (i.e. across all transcripts). So thisis what you want to get - I thought it was being stored.

In terms of the process we should make a determination of which consequences are slam dunks as the most likely causal from the list. This is what the previous scoring tried to do, by setting a threshold on the consequence scores below which the functional genomics data came into play. This is also a good reason for picking the most severe consequence as this is most likely to be causal and implicate the gene it lies in.

Not sure what your other question means, do you mean can the same consequence occur across transcript types? Yes, in some cases. Obviously protein coding changes can't occur in non-coding transcripts, but the others all can. However many of the transcripts that are annotated are incomplete, so sometimes hard to say what the relevance of the consequence is.

elipapa commented 6 years ago

we are planning to include gene_type in the next iteration of the v2g table, which will allow us to determine if we have cases in which severe consequences land on non-coding genes.

To clarify, we store all VEP consequences (and the gene each maps to, through the transcript), not the output of --most_severe, because we grab the data from the dump and not the tool/api. However, from the VEP docs:

--most_severe | Output only the most severe consequence per variant. Transcript-specific columns will be left blank. Consequence ranks are given in this table.To include regulatory consequences, use the --regulatory option in combination with this flag.Not used by default

we can tell that --most_severe outputs the HIGH consequences from this table

Our options: 1) use the scoring you have been using in postgap and that @edm1 pointed to above. 2) score the consequences as 1 to 4 from HIGH to MODIFIER as in the consequence table from the original VEP 3) start with 2 and then further refine from HIGH to HIGH and SLAMDUNKS

iandunham commented 6 years ago

Let's compare the scoring that's in the postgap score with the VEP Impacts - I think these were developed in parallel. It looks to me like we can go with either just HIGH as being the ones where VEP dominates, or HIGH plus MODERATE

elipapa commented 6 years ago

thought: would the fact that a qtl is found in a single tissue make the connection stronger, rather than weaker? in that case we might want to invert the ranking based on tissue count

iandunham commented 6 years ago

If it's the "right" tissue. Ideally you'd want to know whether the tissue profile fits with the tissue profile for the disease. For instance if you were looking at height and there was just an eqtl in liver, say, you might not be so interested, but if it was NASH then you would.

Are we rewarding more eqtls across more tissues? Might be better to reward the min p value across all tissues

edm1 commented 6 years ago

We currently calculate two features for the QTLs: max score (which equates to min p-value across tissues), and 'number of tissues' found in. Personally, I don't think the 'number of tissues' is very useful as a predictor. It depends entirely on the architecture of each disease, and so is not generalisable.

iandunham commented 6 years ago

I agree - max score is the useful one. Later if we could identify the expectation for the disease in terms of tissues we could try to match that up with the eQTL tissues

edm1 commented 6 years ago

@mkarmona here is the updated pseudocode taking into account the above discussion

# Pseudocode v2

This is assuming we've fixed for a single variant. In practice, we'll calculate for all variants in a region.

## Preprocessing (applied across whole dataset)
- Create a score column
  * QTL datasets: 1 - pvalue
  * Interval datasets: interval score
  * VEP map to v2g_score column in gs://genetics-portal-data/lut/vep_consequences.tsv
- Group by (source, tissue), transform into quantiles (I suggest 10 bins [deciles] but this is flexible)

## Aggregate across tissues (per source)
- Group by (source, gene)
- Calculate max score across tissues
- This gives a score per gene for each source

## Aggregate across sources (per variant)
- Group by (gene)
- Caluclate weighted mean over sources using source weights (see below)
- This gives a score per gene for a given variant

source_weights = {
    'vep': 1,
    'javierre2016': 0.33,
    'andersson2014': 0.33,
    'thurman2012': 0.33,
    'gtex_v7': 0.66,
    'sun2018': 0.66
}

Python script here: https://github.com/opentargets/v2g_data/blob/master/v2g_score_prototype/scripts/score_v2g_v2.py

mkarmona commented 6 years ago

This is a temporal quick hack as I didnt properly format vep so I am not taking it into account at this moment

create materialized view ot.v2g_quantiles engine=Memory populate as select source_id, feature,quantilesIf(0.10, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)(1 - qtl_pval, qtl_pval> 0) as qtl_quantiles,quantilesIf(0.10, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)(interval_score, qtl_pval = 0) as interval_quantiles, minIf(qtl_pval, qtl_pval > 0) as min_qtl, minIf(interval_score, qtl_pval = 0) as min_interval, maxIf(qtl_pval, qtl_pval > 0) as max_qtl, maxIf(interval_score, qtl_pval = 0) as max_interval from ot.v2g where source_id <> 'vep' group by source_id, feature

To continue with next step we must fix a variant, so by example 16_53800954_T_C which is the FTO example

mkarmona commented 6 years ago

WIP

select *from (select source_id, gene_id, feature, qtl_score, interval_score, qtl_quantiles, interval_quantiles from ( select source_id, gene_id, feature, maxIf(1 - qtl_pval, qtl_pval > 0) as qtl_score , maxIf(interval_score, qtl_pval = 0) as interval_score from ot.v2g prewhere chr_id = '16' and position = 53800954 and variant_id = '16_53800954_T_C' and source_id <> 'vep' group by source_id, gene_id, feature) all inner join (select source_id, feature, qtl_quantiles, interval_quantiles from ot.v2g_quantiles) using source_id, feature) array join interval_quantiles, qtl_quantiles, arrayEnumerate(interval_quantiles) as interval_quantiles_pos

pretty printed

SELECT *                                                                                                                                                                                                                              
FROM                                                                                                                                                                                                                                  
(                                                                                                                                                                                                                                     
    SELECT                                                                                                                                                                                                                            
        source_id,                                                                                                                                                                                                                    
        gene_id,                                                                                                                                                                                                                      
        feature,                                                                                                                                                                                                                      
        qtl_score,                                                                                                                                                                                                                    
        interval_score,                                                                                                                                                                                                               
        qtl_quantiles,                                                                                                                                                                                                                
        interval_quantiles                                                                                                                                                                                                            
    FROM                                                                                                                                                                                                                              
    (                                                                                                                                                                                                                                 
        SELECT                                                                                                                                                                                                                        
            source_id,                                                                                                                                                                                                                
            gene_id,                                                                                                                                                                                                                  
            feature,                                                                                                                                                                                                                  
            maxIf(1 - qtl_pval, qtl_pval > 0) AS qtl_score,                                                                                                                                                                           
            maxIf(interval_score, qtl_pval = 0) AS interval_score                                                                                                                                                                     
        FROM ot.v2g                                                                                                                                                                                                                   
        PREWHERE (chr_id = '16') AND (position = 53800954) AND (variant_id = '16_53800954_T_C') AND (source_id != 'vep')                                                                                                              
        GROUP BY                                                                                                                                                                                                                      
            source_id,                                                                                                                                                                                                                
            gene_id,                                                                                                                                                                                                                  
            feature                                                                                                                                                                                                                   
    )                                                                                                                                                                                                                                 
    ALL INNER JOIN                                                                                                                                                                                                                    
    (                                                                                                                                                                                                                                 
        SELECT                                                                                                                                                                                                                        
            source_id,
            feature,
            qtl_quantiles,
            interval_quantiles
        FROM ot.v2g_quantiles
    ) USING (source_id, feature)
)
ARRAY JOIN 
    interval_quantiles,
    qtl_quantiles,
    arrayEnumerate(interval_quantiles) AS interval_quantiles_pos

mkarmona commented 6 years ago

a new drafted query with precomputed percentiles with a fixed variant 16_53800954_T_C

SELECT 
    gene_id,
    source_id,
    ifNull(max(interval_score_q) * 1., 0.) AS mi,
    ifNull(max(qtl_score_q) * 1., 0.) AS mq,
    ifNull(max(fpred_max_score) * 1., 0.) AS mf,
    ((mi + mq) + mf) / 3 AS score
FROM ot.v2g
PREWHERE (chr_id = '16') AND (position = 53800954) AND (ref_allele = 'T') AND (alt_allele = 'C')                                                              
GROUP BY 
    gene_id,
    source_id
ORDER BY score DESC

┌─gene_id─────────┬─source_id────┬──mi─┬──mq─┬─mf─┬───────────────score─┐
│ ENSG00000176842 │ javierre2016 │ 0.7 │   0 │  0 │  0.2333333333333333 │
│ ENSG00000140718 │ gtex_v7      │   0 │ 0.2 │  0 │ 0.06666666666666667 │
│ ENSG00000140718 │ vep          │   0 │   0 │  0 │                   0 │
└─────────────────┴──────────────┴─────┴─────┴────┴─────────────────────┘

3 rows in set. Elapsed: 0.009 sec. Processed 73.73 thousand rows, 2.59 MB (8.14 million rows/s., 286.10 MB/s.)

mkarmona commented 6 years ago

one of the problems associated with the defined way to score is the triple aggregation: across tissues, then across sources and everything across genes

mkarmona commented 6 years ago

Temporally solved as

create table if not exists ot.v2g_score_by_source
engine MergeTree partition by (source_id, chr_id) order by (variant_id, gene_id)
as select
  chr_id,
  variant_id,
  gene_id,
  source_id,
  max(ifNull(qtl_score_q, 0.)) AS max_qtl,
  max(ifNull(interval_score_q, 0.)) AS max_int,
  max(ifNull(fpred_max_score, 0.)) AS max_fpred,
  max_qtl + max_int + max_fpred as source_score
from ot.v2g
group by source_id, chr_id, variant_id, gene_id

across all sources

create table if not exists ot.v2g_score_by_overall
engine MergeTree partition by (chr_id) order by (variant_id, gene_id)
as select
  chr_id,
  variant_id,
  gene_id,
  avg(source_score) as overall_score
from ot.v2g_score_by_source
group by chr_id, variant_id, gene_id

And a query using the overall score

select
 index_variant_id,
 top_genes,
from
 (
   select
     index_variant_id
   from ot.v2d_by_stchr
   prewhere stid = 'NEALEUKB_50'
   group by index_variant_id
 )
   all inner join
 (
   select
     variant_id as index_variant_id,
     groupArray(tuple(gene_id,overall_score)) as top_genes,
   from ot.d2v2g_score_by_overall
   prewhere
     variant_id = index_variant_id and
     overall_score >= 0.9
   group by variant_id
 )
using index_variant_id
limit 10

returns

┌─index_variant_id─┬─top_genes─────────────────┐
│ 10_126290161_C_T │ [('ENSG00000189319',0.9)] │
└──────────────────┴───────────────────────────┘
┌─index_variant_id─┬─top_genes───────────────┐
│ 10_104285594_A_T │ [('ENSG00000138111',1)] │
│ 4_1341553_A_G    │ [('ENSG00000179979',1)] │
└──────────────────┴─────────────────────────┘
┌─index_variant_id─┬─top_genes─────────────────────────────────────────┐
│ 17_46123004_G_A  │ [('ENSG00000005243',0.9)]                         │
│ 11_75275479_T_G  │ [('ENSG00000149273',0.9)]                         │
│ 12_110016450_G_A │ [('ENSG00000135093',0.9),('ENSG00000256351',0.9)] │
│ 20_34025756_A_G  │ [('ENSG00000101019',1),('ENSG00000242372',0.9)]   │
│ 17_76790279_C_T  │ [('ENSG00000108669',1)]                           │
└──────────────────┴───────────────────────────────────────────────────┘
┌─index_variant_id─┬─top_genes───────────────────────────────────────────────────────────────────────────────────────┐
│ 17_7204078_A_T   │ [('ENSG00000004975',0.9),('ENSG00000161956',0.9),('ENSG00000132507',1),('ENSG00000161960',0.9)] │
│ 1_218520995_A_G  │ [('ENSG00000143353',1),('ENSG00000067533',1)]                                                   │
└──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.163 sec. Processed 3.29 million rows, 116.25 MB (20.18 million rows/s., 713.55 MB/s.)

edm1 commented 6 years ago

I have added a file containing source_id weights for the scoring algorithm to gs://genetics-portal-data/lut/v2g_scoring_source_weights.tsv.

These weights are currently completely arbitrary. The weighted average is:

( Σ x_i * w_i ) / C

where C = sum(weights in the file) is constant across variants. This should allow scores to be compared across variants, e.g. when calculating an index-to-gene score (in the future).

mkarmona commented 6 years ago

Done at clickhouse code level. @edm1 I already pushed my changes to the clickhouse loader scripts. It is running for the current data and it will be finished in a couple of hours

opentargets / issues

decide how to compute "g2v score" from new v2g table #2148