natsuhiko / rasqual

Robust Allele Specific Quantification and quality controL
37 stars 20 forks source link

QQ plot from rasqual P-values #43

Closed plbngl closed 4 years ago

plbngl commented 4 years ago

Dear Natsuhiko,

I have a question about the results from my ATAC caQTL analysis: If I plot the p-values to make a qq plot, I see something like this download-1 where the P-values lay under the diagonal. I use covariates and size factor files generated by rasqualtools , plus I add sex and 1K genomes first 4 PCs in the covariate file. It seems that overall there is less deflation when I do not add the additional covariates. Does it seem correct to you? DO you recommend any strategy to improve it? Thank you very much,

Paola

plbngl commented 4 years ago

SOrry the qq-plot is this download

natsuhiko commented 4 years ago

Hi Paola,

First of all, did you plot log10 P-values or log10 Q-values? I would recommend to plot Q-values (on column 10 of the output file). In addition, it is unlikely that those Q-values follow a uniform distribution. I would also recommend to plot the Q-values against the permuted Q-values which can be obtained by '-r' option.

Best regards, Natsuhiko

plbngl commented 4 years ago

Hi Natsuhiko- Thank you ! I plotted the -log10(p-values) from the lead variant results (option -t), calculated from column 11 as follows: pchisq(result_lead[,11], 1, lower=F), against a null uniform distribution.

Now I plot colum 10 against permuted one as you suggest and have this: download-3 Is this how I should do it ?

plbngl commented 4 years ago

I wondered if it is expected that they are from the beginning ?- I also plotted the q values from all snps (not just the lead variants) and produces similar results. Thank you very much! Paola

natsuhiko commented 4 years ago

I think the above plot looks reasonable to me (the lead q-values against the lead permuted q-values). The plot suggests you can perform much better FDR calibration with the lead permuted Q-values (because points near the origin are exactly on the diagonal line). I recommend to perform genome-wide FDR correction with the lead q-values, but not with all q-values (including non-lead variants).

Natsuhiko

plbngl commented 4 years ago

Thank you for your answer! I have one more question about how lead SNPs are chosen. I understand that they are chosen randomly among those with the same, lowest, q-value for a given feature. Is it correct and why is random and not for example based on position or lowest p-value? My only problem with having it picked random is when I compare different dataset or cell type for example having the same qtl, the reported lead variant is different.... Thank you again!! Best, Paola

natsuhiko commented 4 years ago

If the lead variants have the same q-value, they also have the same p-value. If you want to get the mapping result for all variants, you just remove '-t' option to rerun RASQUAL (it may take time though).

Natsuhiko

plbngl commented 4 years ago

Thank you , yes, I've been doing that, but I get identical q-values for (slightly) different p-values or Chi-squares....like in this example - I can reproduce the same q-values if I do p.adjust of the p-values....., which I think is what you do to get the q-value? Untitled

natsuhiko commented 4 years ago

Sorry, I don't really understand your problem here. As you mentioned I use the Benjamini-Hochberg method (equivalent to p.adjust with method="BH") to compute the q-values.

Best regards, Natsuhiko

plbngl commented 4 years ago

My question was about picking the lead variant for a given Feature, when multiple variants have the same q-values (ties) but different p-values. Why is it random and not the variant with the lowest p-value (or largest Chisquare), as in my example. But it's ok I can always go back to the results for all variants and chose a different lead. Thank you very much! Best Paola

natsuhiko commented 4 years ago

The lead variant was defined by the maximum chi-square value (i.e., minimum P-value) but not q-value. You should get the lead variant with '-t' which gives you the minimum P-value. If not, please let me know.

Best regards, Natsuhiko

plbngl commented 4 years ago

That's true, I did not realize that. I thought they were chosen random if having the same qval, probably misled by the description of column 21 of the results "Random location of ties (tie lead SNP; only useful with -t option)". Thank you very much!

plbngl commented 4 years ago

Sorry just one more unrelated quick question: what --min-coverage-depth filter for fSNP did you use in your paper? In the figures I see you show fSNPs with >20 reads but I am not sure if you apply the filter -min-coverage-depth in the calculation. Thanks again!!! Best

natsuhiko commented 4 years ago

The option is always applied to your data to filter out fSNPs with low coverage depth (default is 5% of mean coverage depth at coding regions). Those SNPs are still used as tested SNPs, but the AS counts are no longer used. This option is useful when there are too many fSNPs with shallow coverage depth due to bad quality of RNA. Usually you don't need to tweak this option.

Best regards, Natsuhiko

plbngl commented 4 years ago

Thank you ! and in case of ATAC seq that would be 5% of mean coverage at accessible regions? (peaks?)

natsuhiko commented 4 years ago

For the ATAC-seq, the average coverage depth in the peak region that you define by -s and -e.

Best regards, Natsuhiko

plbngl commented 4 years ago

ok thank you very much!!