Closed ireneisdoomed closed 1 month ago
The main purpose of this is to perform a sanity check on the PIPs we obtain from the eQTL Catalogue. This analysis is crucial for ensuring the reliability of these PIPs for downstream COLOC and L2G.
For the QC, we derived posterior probabilities from Bayes Factors (BFs) for each variant within the credible sets following this:
lbf_variable = # array of log10BFs for all variants in the credible set
priors = np.log(1e-4)
credible_set_lbf = logsumexp(lbf_variable + priors)
calculated_pips = np.exp(lbf_variable + priors - credible_set_lbf)
I evaluated the PIPs for a credible set from Sun et al., and saw a high correlation coefficient of 0.999 with our calculated PIPs. This result indicates that PIPs as reported by the eQTL Catalogue are accurate and reliable based on the examined set.
eQTL Catalogue reports results for different methods of quantifying thje raw RNA-seq data. We agreed that we are mostly interested in gene expression (ge
) results because the extra granularity of the other methods wouldn't potentially have a huge impact for the gene prioritisation task. It also reduces the amount of data significantly, and therefore the computation task.
Data is available here:
gs://eqtl_catalog_data/credible_set_datasets/susie_0103
gs://eqtl_catalog_data/study_index_0103
+----------------+------+
|nCredSetPerStudy| count|
+----------------+------+
| 10| 2|
| 9| 3|
| 8| 6|
| 7| 25|
| 6| 71|
| 5| 313|
| 4| 1232|
| 3| 7555|
| 2| 46542|
| 1|262162|
+----------------+------+
+-------+------------------+
|summary| credSetSize|
+-------+------------------+
| count| 385100|
| mean|31.648021293170604|
| stddev| 99.3423119691853|
| min| 1|
| 25%| 3|
| 50%| 11|
| 75%| 32|
| max| 4090|
+-------+------------------+
We have credible sets for GTEx studies with summary statistics that were fine mapped with PICS. We want to compare the harmonised SuSIE results with these, as we assume that the same credible sets should be captured in both datasets.
After comparing, we see that there is a big under representation of credible sets in the SuSIE results:
pics.select("studyId", "variantId").distinct().count()
235114
pics.join(susie, on=["studyId", "variantId"], how="left_anti").select("studyId", "variantId").distinct().count()
194608
susie.select("studyId").distinct().count()
>>> 317911
missing_studies.select("studyId").distinct().count()
>>> 144443
(The numbers here don't intend to represent a faithful difference between datasets, however we think it is a strong enough indicator to look at it.)
I have taken the 2_61145163_C_G
locus as an example. This locus has shown a very statistically significant (1e-252) association with ENSG00000237651 expression that we represent in our PICS results.
studyId | gtex_artery_tibial_ensg00000237651
studyLocusId | -1609360246168945094
variantId | 2_61145163_C_G
chromosome | 2
position | 61145163
beta | 1.07003
oddsRatio | null
oddsRatioConfidenceIntervalLower | null
oddsRatioConfidenceIntervalUpper | null
betaConfidenceIntervalLower | 1.0537764
betaConfidenceIntervalUpper | 1.0862836
pValueMantissa | 1.251
pValueExponent | -252
effectAlleleFrequencyFromSource | 0.44863
standardError | 0.0162536
subStudyDescription | null
qualityControls | []
finemappingMethod | null
However, neither the study or the association is part of of the SUSIE credible sets (dataset QTD000141). I can only see the information if I look at the exon expression (dataset QTD000142) or transcript usage (dataset QTD000142) results.
-RECORD 0-------------------------------------------------------
molecular_trait_id | ENSG00000115464.15_2_61189454_61189697
gene_id | ENSG00000115464
cs_id | ENSG00000115464.15_2_61189454_61189697_L3
variant | chr2_61145163_C_G
rsid | rs3213944
cs_size | 32
pip | 0.00483149909469316
pvalue | 0.827647
beta | -0.00882123
se | 0.0404976
z | -0.219423623791845
cs_min_r2 | 0.645227986566961
region | chr2:60189575-62189575
dataset_id | QTD000142 <--- exon expression
-RECORD 1-------------------------------------------------------
molecular_trait_id | ENST00000498268
gene_id | ENSG00000115464
cs_id | ENST00000498268_L1
variant | chr2_61145163_C_G
rsid | rs3213944
cs_size | 68
pip | 0.0182487218166061
pvalue | 4.98004e-09
beta | 0.271302
se | 0.0456777
z | 5.98042337599837
cs_min_r2 | 0.710026045233212
region | chr2:60471087-62471087
dataset_id | QTD000143 <--- tx
The fact that we see this association in PICS means that it is part of the summary statistics, therefore it should be represented downstream. I'm tagging @kauralasoo to look at this example and perhaps shed some light.
Link to code gist: https://gist.github.com/ireneisdoomed/35c9a1e8f266442bdfbf2eba95c25eca
As per next steps, we have decided to ingest the results from all quantification methods so that we extend the coverage and counteract the impact of this issue.
Hi Irene,
For this particular example there is no credible set, because this particular variant is not a significant eQTL in our re-analysis of GTEx. A useful troubleshooting strategy can be to look up nominal p-values from the full summary statistics files (only available for ge quantification method):
tabix ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000015/QTD000141/QTD000141.all.tsv.gz 2:61145162-61145163 | grep ENSG00000237651
The p-value in our re-analysis is 0.70.
We do know that there are some differences in the eQTL results between our re-analysis and the original GTEx analysis, but this is probably one of the most extreme example. We do not know what all of the causes for these discrepancies are, but one source of variation could be how alternative haplotypes/contigs are handled during the read alignment and what is done to multi-mapping reads.
The eQTL Catalogue makes available through FTP their credible sets extracted after fine mapping summary statistics with SuSIE. So far, we have generated credible sets from GTEx summary statistics. This work pretends to extend the coverage to all their fine mapping results.
Background
There are 3 main datasets we need to process:
Tasks