samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
678 stars 239 forks source link

bcftools consensus producing staggered phylogenetic tree topology #1323

Closed RJBeng closed 4 years ago

RJBeng commented 4 years ago

Hello,

I have encountered a very strange problem when calling consensus sequence from vcf using bcftools consensus

bcftools consensus -m $MASK snps.vcf.gz > ${sample}_consensus_masked.fa

This produces a phylogenetic tree with a weird staggered looking topology which I don't get if I use vcfutils.pl vcf2fq with the same vcf file. I am baffled as to why this is happening.

RAxML tree produced by Gubbins using bcftools consensus alignments

Screenshot 2020-10-07 at 15 52 36

RAxML tree produced by Gubbins using vcf2fq consensus alignments

Screenshot 2020-10-07 at 15 50 56

Many thanks Rebecca

RJBeng commented 4 years ago

Should have mentioned that the same regions were masked in the consensus sequences produced by both tools.

daviesrob commented 4 years ago

This looks like a bcftools query. I'll transfer it to the bcftools issue tracker.

pd3 commented 4 years ago

bcftools consensus merely applies variants from the VCF onto the fasta sequence. I don't know how sensitive their algorithm is to noise in the input data, that's a question for Gubbins authors.

On the side of bcftools, make sure that variant calling and filtering are done properly. This small howto gives some hints, but note that it avoids being too specific on purpose because every callset tends to be different, you'll have to play with the data and experiment http://samtools.github.io/bcftools/howtos/variant-calling.html.

Creating the consensus sequence is just the last, technical, step and should be easy http://samtools.github.io/bcftools/howtos/consensus-sequence.html

RJBeng commented 4 years ago

After looking further into this issue it seems that the problem is not due to bcftools consensus.

I used a old script with samtools mpileup, which is deprecated now, but is able to generate a "normal" looking tree topology shown in the second figure above.

samtools mpileup -d 1000 -guB -t DP,DV,DP4,SP -f $REF $BAM > ${sample}.tmp bcftools call -c --ploidy 1 ${sample}.tmp -O u -o ${sample}.bcf bcftools view ${sample}.bcf | vcfutils.pl vcf2fa -d 4 > ${sample}.pseudogenome.fa

Trying to update this pipeline using bcftools mpileup, whilst keeping all other downstream parameters the same produces the weird artificial staggered topology shown in the first figure above.

bcftools mpileup -B -d 1000 -a DP,DV,DP4,SP -f $REF $BAM > ${sample}.tmp

Using the steps outlined in http://samtools.github.io/bcftools/howtos/variant-calling.html also produces the staggered topology.

If I use the old script with samtools mpileup and then multi-thread bcftools calls, the final tree also staggers. What's weird is that this only happens to few parts of the tree. Perhaps the weird staggered topology is just the true nature of the tree, but it looks artificial to me. What I can't understand is why multi-threading bcftools calls would make a difference.

The version of bcftools im using is 1.9-80. I have tried the same scripts using the newest release of bcftools 1.11 also didn't make a difference.

pd3 commented 4 years ago

I am afraid you'll have to dive in and investigate what the variants called in the old and the new version look like. Do they look trustworthy, are there any obvious biases?

RJBeng commented 4 years ago

Okay will have a look. Many thanks for your help :)