vibansal / crisp

Code for multi-sample variant calling from sequence data of pooled or unpooled DNA samples
MIT License
19 stars 8 forks source link

Variant Quality and Variant Pools #11

Closed abofinal closed 4 years ago

abofinal commented 5 years ago

Hello again! This tool is doing wonders for analyzing the pools I'm working with, so I would thank you if you could solve another doubt I have interpreting the output file.

About the quality field in the output, my samples vary greatly between qualities with a minimum value of 20 (most variants are in the 20-200 range approximately) and some reach maximum values several orders of magnitude greater (~800,000 for example). Something similar happens in other examples from a few papers, but I couldn't get much information from there.

We can assume all of the variants from the output passed CRISP's tests. Is there a great difference between low quality variants and huge, out-the-charts quality variants?

Also, I'm very interested in keeping low-frequency variants in my pools. Is the quality value dependant on allele frequency somehow? As in, the lower the frequency, the lower the quality?

Again, very great work with CRISP, I hope you can answer my questions!

Greetings, ~Ramón

vibansal commented 5 years ago

The variant quality compares the likelihood of the reads under the null model (no variant) to the alternate (variant is present). Therefore, it depends on the number of pools that have the variant allele and the allele frequency. For rare variants, the variant quality will be on the lower end since only 1-2 pools have the variant.

abofinal commented 5 years ago

Thank you, that was very helpful again, as I was pondering the idea of filtering the variants below a certain quality threshold. That would bias the data against rare variants however, which are the ones I'm most interested in keeping.

One other thing I wanted to ask. In this case, for example: 2L 8202 . A G 141 LowDepth NP=5;DP=197,108,0;VT=SNV;CT=-0.5;VP=2;VF=EMpass;AC=21;AF=0.02939;EMstats=14.20:-8.14;HWEstats=-0.0;MQS=0,0,0,321;FLANKSEQ=gtgacgcacc:A:cagccgttaa AC:GQ:DP:ADf:ADr:ADb .:1:51:29,1:15,0:0,0 .:0:48:25,0:17,0:0,0 .:1:75:46,1:25,2:0,0 .:0:69:44,2:20,2:0,0 .:0:78:48,1:27,0:0,0

I've been trying to understand how to know which of the pools is considered variant by the program. I thought the basic (default) requirement was for reads with the alternate allele to appear 4 times in a pool. In this example, we have NP=5 pools, and VP=2 variant pools.

However, only one of the pools reaches a number of 4 alternate alleles (fourth pool- 69:44,2:20,2:0,0 > 2+2=4). I assume the other variant pool is the third pool, but it doesn't reach the 4 threshold (75:46,1:25,2:0,0 > 1+2=3), so I could be wrong and another of the pools could be being seen as the variant one.

Is there any way that I can unequivocally differentiate between variant and non-variant pools from the output of CRISP?

Thank you for answering all my questions, I'm very interested in understanding the most out of this program's output!

vibansal commented 5 years ago

The 'AC' field contains the most likely number of variant alleles per pool. If that number is > 0 and the corresponding GQ is greater than a threshold, one can be confident that the pool contains the variant. In the example above, the AC field is set to '.' (missing) due to the low coverage