vibansal / crisp

Code for multi-sample variant calling from sequence data of pooled or unpooled DNA samples
MIT License
19 stars 8 forks source link

Vcf output #4

Closed PopGenHamburg closed 6 years ago

PopGenHamburg commented 6 years ago

Hi! CRISP seemed like a nice tool so I used it for a very small dataset. It gives good results, but the trouble begins when I try using the vcf file in sorting/filtering tools such as bcftools. I cannot index it, the error message reads "truncated". "bcftools index Refsvariantcalls.VCF index: the input is probably truncated, use -f to index anyway: Refsvariantcalls.VCF"

Upon inspection, it seems the header and the INFO and FORMAT fields don't agree much. Some variables are in the header but not used anymore, which would be okay, but this also occurs the other way around. Could you please have a look?

In the header AND used in variants list: INFO: NP, DP, CT FORMAT: GT, AF, ADf, ADr

In the header but not used: INFO: MQ, HP FORMAT: MLAC, ADb

Used in main variants list BUT not in the header (I think this is what caused the error message) INFO: VT, MQS, FLANKSEQ FORMAT: DP (although this one is defined in the INFO, but I guess it should be defined again)

Another issue is the main part of the file, the format coulmn reads GT:AF:DP:ADf:ADr, which would mean there should be 5 fields, but here are 6 different fields in the result for each sample, like this: GT:AF:DP:ADf:ADr 0/0:0.002:5110:5120:3747,6:1355,2 The last two values are ADf and ADr, was is the one before?

I am not a programmer and/or not skilled enough to change it myself in the source code, would it be possible to implement it soon(ish)? Thanks! Mathilde

vibansal commented 6 years ago

Thank you for the feedback. I have updated the VCF header and output to fix the missing information.

I tested the VCF output on a small dataset and it can be indexed using bcftools or tabix:

  1. bgzip output.vcf
  2. bcftools (v 1.1) index output.vcf.gz

The truncated output from bcftools may suggest that the VCF file is incomplete for some reason.

Please let me know if there any further issues.

PopGenHamburg commented 6 years ago

Dear Vikas, thank you for looking at it so quickly. I'll test it right away!

PopGenHamburg commented 6 years ago

It almost worked, now all the INFO/FORMAT variables are there. However, there is still an issue with the FORMAT column in the main vcf body: 5 fields, six values given for each pool... I replaced it manually this time because it is easy with a small file:

GT:AF:DP:ADf:ADr becomes GT:AF:DP:AC:ADf:ADr

Afterwards the following steps worked just fine:

bcftools view Refsvariantcalls_corrected.vcf -Oz -o Refsvariantscalls_corrected.vcf.gz
tabix Refsvariantscalls_corrected.vcf.gz
bcftools filter -O v -T targets.txt Refsvariantscalls_corrected.vcf.gz -o Refsvariants_filtered.vcf

Best, Mathilde

vibansal commented 6 years ago

I have updated the code to output 5 fields in the genotype column. Thanks.