interpreting vcf file from CRISP

vibansal / crisp

Code for multi-sample variant calling from sequence data of pooled or unpooled DNA samples

MIT License

19 stars 8 forks source link

interpreting vcf file from CRISP #6

Closed mbxao3 closed 5 years ago

mbxao3 commented 6 years ago

Hi, I have difficulty understanding the meaning of some of the fields, as either they are not described or their values do not correspond with their description. For instance, I don't understand why DP has three values and why the FILTER field has a numerical value of zero (instead of pass or fail) in some sites. Could you please explain the meaning of the following fields and their values- filed (example value): CT (-3.1); DP (7432, 7321, 1056); VF (Emfail), FILTER (0). Also, why does CRISP not produce any GT information for any of the pools. The absence of a GT info field has made my output VCF unreadable by several other scripts. I would really appreciate your assistance with these as I am running out of time for my dissertation. Thank you.

vibansal commented 6 years ago

Three values for DP correspond to read depth on forward strand, reverse strand and bidirectional reads.

VF = Emfail implies that the variant did not pass the EM-based genotyping

To convert the CRISP VCF output to a genotype-based VCF, please use the python script available in the scripts directory.

mbxao3 commented 6 years ago

Many thanks for your response. It has been really helpful Pardon me for asking basic questions, but I don't fully understand the impact of failing the EM-based genotyping on variant authenticity. Does failing the EM-based genotyping mean that, at that locus, the fluorescence intensity produced is not sufficient enough to accurately determine the incorporated base/allele, based on the Expectation-Maximisation algorithm? I have taken a look at papers on EM-based genotyping, but didn't really find one related to pooled deep sequencing. I am quite new to this and will really appreciate your assistance. A quick insight will be sufficient to guide me in the right direction to get more information. Thank you.

vibansal commented 6 years ago

The EM-based genotyping is used to estimate the pooled genotypes. In our datasets, the 'Emfail' filter can identify many variants that are artifacts of sequencing error, especially strand-specific sequencing errors.