Open kwcurrin opened 5 years ago
Hi Kevin,
As you know RASQUAL estimates the QTL effect (Pi value) jointly from the total read counts (between individuals) and allele specific (AS) counts (within each individual). In an ideal situation (with no bias in AS counts), the Pi estimates from the two data sources should be identical. However, in reality, we know there are a lot of biases affecting the Pi estimate from AS counts. The reason why we share the Pi value between the two data sources is to stabilise Pi estimation.
The reference allele bias is the one that mostly confounds the Pi estimate, since we tend to map more reads with the reference allele than the alternative. If the Pi estimate is skewed by AS counts towards the reference allele, there will be a discrepancy between QTL signals estimated only from total read counts and only from AS counts. The Phi value in RASQUAL simply captures the balance between these signals. Intuitively, it is a residual (fake) QTL signal only from AS counts. Note that the Phi is shared across all SNPs, but not a single SNP (as far as you used multiple feature SNPs).
Best regards, Natsuhiko
Hi Natsuhiko,
That helps me understand the model a lot better. That is very clever!
I had two followup questions to make sure I understand things correctly:
Thanks,
Kevin
Sorry, there was a typo in question 1, I meant phi when I wrote pi:
Hi Kevin,
In reality, there are multiple fSNPs within a feature region and they are not in perfect LD with the rSNP, we can still estimate both Pi and Phi only from AS counts, because Phi captures allelic imbalance specifically toward reference alleles, whereas Pi captures the allelic imbalance only seen at heterozygous individuals at the rSNP. Intuitively, all AS counts at heterozygous fSNPs are skewed toward the reference allele, it is likely to be the reference bias. Whereas, it is a true QTL signal, only AS counts at heterozygous fSNPs that links to the heterozygous rSNP is skewed toward either the reference or alternative allele. Of course, it is much harder to estimate Pi only with AS counts, I would say.
Best regards, Natsuhiko
Hi Natsuhiko,
Thanks! That answers both of my questions.
However, I am nervous about reference allele bias estimates with --as-only for small feature regions. Our ATAC-seq peaks aren't very wide (median width ~500bp, upper quartile ~800bp), so I don't think there will be enough SNPs within many of the peaks that aren't in high LD to give good estimates. For example, of my 1,011 significant SNP-peak pairs at FDR<5%, 212 of the peaks have 3 or fewer fSNPs and 522 (over half) have 5 or fewer fSNPs. This isn't even considering that many of these SNPs may be highly correlated.
I realize that this wouldn't be as much of an issue for larger features, like genes or broad histone marks, but it makes me nervous for narrow open chromatin peaks.
Thanks,
Kevin
Hi Kevin,
I think you can check the distribution of Pi for QTLs with small number of fSNPs and rSNP in the target feature. If the distribution is skewed toward 0 (Pi<0.5 for potential reference bias), you might need to incorporate a stringent filtering for those QTLs.
Best regards, Natsuhiko
Hi Natsuhiko,
Could you explain how reference allele bias can be calculated per SNP? I understand how reference allele bias can be calculated genome-wide per individual by taking the average reference allele ratio across all SNPs per individual. However, I don't understand how this can be done individually per SNP. We don't know if the reference allele bias is real or technical for a given SNP unless we remove technical bias with a program like WASP. Estimating the bias between individuals for a given SNP also seems problematic because all individuals should have allelic imbalance if it is real.
Thanks,
Kevin