Ingest credible sets from eQTL Catalogue

The eQTL Catalogue makes available through FTP their credible sets extracted after fine mapping summary statistics with SuSIE. So far, we have generated credible sets from GTEx summary statistics. This work pretends to extend the coverage to all their fine mapping results.

Background

There are 3 main datasets we need to process:

Studies metadata. Each row represents a study, defined as the publication the differentiation experiment and the quantification method. We will extend the study definition by exploding the ID by the measured trait.

#  study_id        | QTS000001        
#  dataset_id      | QTD000001        
#  study_label     | Alasoo_2018      
#  sample_group    | macrophage_naive 
#  tissue_id       | CL_0000235       
#  tissue_label    | macrophage       
#  condition_label | naive            
#  sample_size     | 84               
#  quant_method    | ge

Dataset with credible sets. Each row represents a variant in a credible set and its statistics.

#  molecular_trait_id | ENSG00000233359          
#  gene_id            | ENSG00000233359          
#  cs_id              | ENSG00000233359_L1       
#  variant            | chr1_102291687_A_T       
#  rsid               | rs12044188               
#  cs_size            | 76                       
#  pip                | 0.0133246034625587       
#  pvalue             | 2.36065e-12              
#  beta               | -0.845301                
#  se                 | 0.0917599                
#  z                  | -10.1635388928995        
#  cs_min_r2          | 0.847176700121545        
#  region             | chr1:101389630-103389630 
#  credibleSetIndex   | 1                        
#  dataset_id         | QTD000046

Dataset with Bayes Factors (log10). Each row represents a variant present in any study and its Bayes Factors per credible set (lbf_variable1 refers to the LBF of the variant for the credible set number 1, for example).

#  molecular_trait_id | ENSG00000272279     
#  region             | chr6:528911-2528911 
#  variant            | chr6_529104_C_T     
#  chromosome         | 6                   
#  position           | 529104              
#  lbf_variable1      | -0.787007730605098  
#  lbf_variable2      | -0.245875563068143  
#  lbf_variable3      | -0.243956583413853  
#  lbf_variable4      | -0.246931930361598  
#  lbf_variable5      | -0.253445111259816  
#  lbf_variable6      | -0.261234926719349  
#  lbf_variable7      | -0.267987449936489  
#  lbf_variable8      | -0.271955807958227  
#  lbf_variable9      | -0.272196182176536  
#  lbf_variable10     | -0.268490278993762

Tasks

[x] Change the study index schema to incorporate tissue information.
[x] Generate a eQTL Catalogue study index dataset based on SuSiE results
[x] Generate a eQTL Catalogue credible set dataset based on SuSiE results
[x] QC the reported posterior probability to ensure it can be derived from LBFs

Validation of posterior probabilities in credible sets

The main purpose of this is to perform a sanity check on the PIPs we obtain from the eQTL Catalogue. This analysis is crucial for ensuring the reliability of these PIPs for downstream COLOC and L2G.

For the QC, we derived posterior probabilities from Bayes Factors (BFs) for each variant within the credible sets following this:

lbf_variable = # array of log10BFs for all variants in the credible set
priors = np.log(1e-4)
credible_set_lbf = logsumexp(lbf_variable + priors)
calculated_pips = np.exp(lbf_variable + priors - credible_set_lbf)

I evaluated the PIPs for a credible set from Sun et al., and saw a high correlation coefficient of 0.999 with our calculated PIPs. This result indicates that PIPs as reported by the eQTL Catalogue are accurate and reliable based on the examined set.

Gist with code

``` import numpy as np import pyspark.sql.functions as f from scipy.special import logsumexp from gentropy.common.session import Session from gentropy.datasource.eqtl_catalogue.finemapping import EqtlCatalogueFinemapping from gentropy.datasource.eqtl_catalogue.study_index import EqtlCatalogueStudyIndex session = Session("yarn") eqtl_catalogue_paths_imported = "gs://eqtl_catalog_data/susie_decompressed_tmp" studies_to_ingest = ["QTD000584"] ## DATA PREPARATION studies_metadata = EqtlCatalogueStudyIndex.read_studies_from_source( session, mqtl_quantification_methods_blacklist=[] ) credible_sets_df = EqtlCatalogueFinemapping.read_credible_set_from_source( session, credible_set_path=[ f"{eqtl_catalogue_paths_imported}/{qtd_id}.credible_sets.tsv" for qtd_id in studies_to_ingest ], ) lbf_df = EqtlCatalogueFinemapping.read_lbf_from_source( session, lbf_path=[ f"{eqtl_catalogue_paths_imported}/{qtd_id}.lbf_variable.txt" for qtd_id in studies_to_ingest ], ) processed_susie_df = EqtlCatalogueFinemapping.parse_susie_results( credible_sets_df, lbf_df, studies_metadata ) sample = ( processed_susie_df.filter(f.col("region") == "chr12:55362857-57362857") .filter(f.col("credibleSetIndex") == 1) .filter(f.col("studyId") == "Sun_2018_plasma_APOF.12370.30.3..1") # credible set of size 134 ) credible_sets = EqtlCatalogueFinemapping.from_susie_results(processed_susie_df) ## QC FUNCTIONS sample_pdf = sample.drop( "chromosome", "position", "geneId", "molecular_trait_id", "c", "nSamples", "beta", "standardError", "finemappingMethod", "traitFromSourceMappedIds", "dataset_id", "projectId", "studyType", "summarystatsLocation", "hasSumstats", ).toPandas() # np.log(np.repeat(1 / p, p)) # lbf_cs = np.apply_along_axis( lambda x: logsumexp(x + priors), axis=0, arr=lbf_variable ) # pip=np.exp(lbf_variable+priors-lbf_cs) lbf_variable = sample_pdf["logBF"].to_numpy() priors = np.log(1e-4) lbf_cs = logsumexp(lbf_variable + priors) sample_pdf["calculated_pip"] = sample_pdf.apply( lambda row: np.exp(row["logBF"] + priors - lbf_cs), axis=1 ) ```

Ingestion of gene expression QTLs

eQTL Catalogue reports results for different methods of quantifying thje raw RNA-seq data. We agreed that we are mostly interested in gene expression (ge) results because the extra granularity of the other methods wouldn't potentially have a huge impact for the gene prioritisation task. It also reduces the amount of data significantly, and therefore the computation task.

Data is available here:

Credible sets: gs://eqtl_catalog_data/credible_set_datasets/susie_0103
Study index: gs://eqtl_catalog_data/study_index_0103

Main metrics

317,911 studies
385,100 credible sets

Most of the studies have one single credible set

+----------------+------+                                                       
|nCredSetPerStudy| count|
+----------------+------+
|              10|     2|
|               9|     3|
|               8|     6|
|               7|    25|
|               6|    71|
|               5|   313|
|               4|  1232|
|               3|  7555|
|               2| 46542|
|               1|262162|
+----------------+------+

Credible sets have, on average, 31 variants in the region.

+-------+------------------+                                                    
|summary|       credSetSize|
+-------+------------------+
|  count|            385100|
|   mean|31.648021293170604|
| stddev|  99.3423119691853|
|    min|                 1|
|    25%|                 3|
|    50%|                11|
|    75%|                32|
|    max|              4090|
+-------+------------------+

Comparison with PICS credible sets extracted from summary statistics

We have credible sets for GTEx studies with summary statistics that were fine mapped with PICS. We want to compare the harmonised SuSIE results with these, as we assume that the same credible sets should be captured in both datasets.

After comparing, we see that there is a big under representation of credible sets in the SuSIE results:

80% of the PICS credible sets are not in the SuSIE credible sets. This number will be influenced by the variant that we define as the lead in the region.

pics.select("studyId", "variantId").distinct().count()
235114
pics.join(susie, on=["studyId", "variantId"], how="left_anti").select("studyId", "variantId").distinct().count()
194608

However, we observe that 45% of the studies in the PICS credible sets are not in the SuSIE credible sets. This is a more alarming metric, because it means that we have studies for which we are not getting any coverage.
```
susie.select("studyId").distinct().count()
>>> 317911
missing_studies.select("studyId").distinct().count()
>>> 144443
```
(The numbers here don't intend to represent a faithful difference between datasets, however we think it is a strong enough indicator to look at it.)

I have taken the 2_61145163_C_G locus as an example. This locus has shown a very statistically significant (1e-252) association with ENSG00000237651 expression that we represent in our PICS results.

 studyId                          | gtex_artery_tibial_ensg00000237651
 studyLocusId                     | -1609360246168945094
 variantId                        | 2_61145163_C_G
 chromosome                       | 2
 position                         | 61145163
 beta                             | 1.07003
 oddsRatio                        | null
 oddsRatioConfidenceIntervalLower | null
 oddsRatioConfidenceIntervalUpper | null
 betaConfidenceIntervalLower      | 1.0537764
 betaConfidenceIntervalUpper      | 1.0862836
 pValueMantissa                   | 1.251
 pValueExponent                   | -252
 effectAlleleFrequencyFromSource  | 0.44863
 standardError                    | 0.0162536
 subStudyDescription              | null
 qualityControls                  | []
 finemappingMethod                | null

However, neither the study or the association is part of of the SUSIE credible sets (dataset QTD000141). I can only see the information if I look at the exon expression (dataset QTD000142) or transcript usage (dataset QTD000142) results.

-RECORD 0-------------------------------------------------------
 molecular_trait_id | ENSG00000115464.15_2_61189454_61189697
 gene_id            | ENSG00000115464
 cs_id              | ENSG00000115464.15_2_61189454_61189697_L3
 variant            | chr2_61145163_C_G
 rsid               | rs3213944
 cs_size            | 32
 pip                | 0.00483149909469316
 pvalue             | 0.827647
 beta               | -0.00882123
 se                 | 0.0404976
 z                  | -0.219423623791845
 cs_min_r2          | 0.645227986566961
 region             | chr2:60189575-62189575
 dataset_id         | QTD000142                    <--- exon expression
-RECORD 1-------------------------------------------------------
 molecular_trait_id | ENST00000498268
 gene_id            | ENSG00000115464
 cs_id              | ENST00000498268_L1
 variant            | chr2_61145163_C_G
 rsid               | rs3213944
 cs_size            | 68
 pip                | 0.0182487218166061
 pvalue             | 4.98004e-09
 beta               | 0.271302
 se                 | 0.0456777
 z                  | 5.98042337599837
 cs_min_r2          | 0.710026045233212
 region             | chr2:60471087-62471087
 dataset_id         | QTD000143                      <--- tx

The fact that we see this association in PICS means that it is part of the summary statistics, therefore it should be represented downstream. I'm tagging @kauralasoo to look at this example and perhaps shed some light.

Link to code gist: https://gist.github.com/ireneisdoomed/35c9a1e8f266442bdfbf2eba95c25eca

As per next steps, we have decided to ingest the results from all quantification methods so that we extend the coverage and counteract the impact of this issue.

Hi Irene,

For this particular example there is no credible set, because this particular variant is not a significant eQTL in our re-analysis of GTEx. A useful troubleshooting strategy can be to look up nominal p-values from the full summary statistics files (only available for ge quantification method):

tabix ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000015/QTD000141/QTD000141.all.tsv.gz 2:61145162-61145163 | grep ENSG00000237651

The p-value in our re-analysis is 0.70.

We do know that there are some differences in the eQTL results between our re-analysis and the original GTEx analysis, but this is probably one of the most extreme example. We do not know what all of the causes for these discrepancies are, but one source of variation could be how alternative haplotypes/contigs are handled during the read alignment and what is done to multi-mapping reads.

opentargets / issues