Closed RJBeng closed 4 years ago
Should have mentioned that the same regions were masked in the consensus sequences produced by both tools.
This looks like a bcftools query. I'll transfer it to the bcftools issue tracker.
bcftools consensus
merely applies variants from the VCF onto the fasta sequence. I don't know how sensitive their algorithm is to noise in the input data, that's a question for Gubbins authors.
On the side of bcftools, make sure that variant calling and filtering are done properly. This small howto gives some hints, but note that it avoids being too specific on purpose because every callset tends to be different, you'll have to play with the data and experiment http://samtools.github.io/bcftools/howtos/variant-calling.html.
Creating the consensus sequence is just the last, technical, step and should be easy http://samtools.github.io/bcftools/howtos/consensus-sequence.html
After looking further into this issue it seems that the problem is not due to bcftools consensus.
I used a old script with samtools mpileup, which is deprecated now, but is able to generate a "normal" looking tree topology shown in the second figure above.
samtools mpileup -d 1000 -guB -t DP,DV,DP4,SP -f $REF $BAM > ${sample}.tmp
bcftools call -c --ploidy 1 ${sample}.tmp -O u -o ${sample}.bcf
bcftools view ${sample}.bcf | vcfutils.pl vcf2fa -d 4 > ${sample}.pseudogenome.fa
Trying to update this pipeline using bcftools mpileup, whilst keeping all other downstream parameters the same produces the weird artificial staggered topology shown in the first figure above.
bcftools mpileup -B -d 1000 -a DP,DV,DP4,SP -f $REF $BAM > ${sample}.tmp
Using the steps outlined in http://samtools.github.io/bcftools/howtos/variant-calling.html also produces the staggered topology.
If I use the old script with samtools mpileup and then multi-thread bcftools calls, the final tree also staggers. What's weird is that this only happens to few parts of the tree. Perhaps the weird staggered topology is just the true nature of the tree, but it looks artificial to me. What I can't understand is why multi-threading bcftools calls would make a difference.
The version of bcftools im using is 1.9-80. I have tried the same scripts using the newest release of bcftools 1.11 also didn't make a difference.
I am afraid you'll have to dive in and investigate what the variants called in the old and the new version look like. Do they look trustworthy, are there any obvious biases?
Okay will have a look. Many thanks for your help :)
Hello,
I have encountered a very strange problem when calling consensus sequence from vcf using bcftools consensus
bcftools consensus -m $MASK snps.vcf.gz > ${sample}_consensus_masked.fa
This produces a phylogenetic tree with a weird staggered looking topology which I don't get if I use
vcfutils.pl vcf2fq
with the same vcf file. I am baffled as to why this is happening.RAxML tree produced by Gubbins using bcftools consensus alignments
RAxML tree produced by Gubbins using vcf2fq consensus alignments
Many thanks Rebecca