Closed tkn132 closed 6 months ago
It would be interesting to see the full line of text in the vcf.gz file rather than re-reading it and printing it out agvain with query
.
Something like zgrep chr1.180951 sample.vcf.gz
would give you the full line. (Doesn't need to run to completion - it could take ages, but this is a line early on in the file so once it's printed up you'll know)
That would tell us whether the original mpileup
/ call
commands worked or not.
Edit: also reporting bcftools --verson
is always useful when reporting problems so we know if this is an up to date binary.
Also, note DP isn't floating point. So it's not a case of decimal places missing. It feels more like some parsing/printing string stutter or BCF byte-packing bug to me. It's notable that all of these wrong values are repeating the same string.
DP is defined as Type=Integer, not as Type=Float, therefore there will be no decimal points. This information can be found in the VCF header.
Actually, we can cut down all the way to the initial mpileup
too:
bcftools mpileup --skip-indels -f ~/db/hg38.fa.gz -r chr1:180951 ${sample}.bam | grep 180951
That is purely reading BAM and outputting VCF. So there is no re-reading of the VCF stream into BCF-internals and no potential for incorrect in-memory BCF processing (I think).
Next would be mpileup | call, which has an additional VCF->BCF->VCF step:
bcftools mpileup --skip-indels -f ~/db/hg38.fa.gz -r chr1:180951 ${sample}.bam | bcftools call -vm - | grep 180951
That would output VCF from mpileup and then call the variant using call
(which reads VCF into BCF-in-memory, processes it, and then reformats back out as VCF again).
Finally that same command can be done without mpileup outtping as VCF, using uBCF directly (as you did):
bcftools mpileup -Ou --skip-indels -f ~/db/hg38.fa.gz -r chr1:180951 ${sample}.bam | bcftools call -vm - | grep 180951
I think those commands are correct anyway! Petr could verify. Basically it's an attempt to try and minimise the problem and work out at what stage "81" ended up looking like "818181".
@tkn132 Sorry I did not read the entire thread thoroughly. Please ignore everything what was said above, both mine and @jkbonfield responses.
There are two problems:
1) as stated above, DP is integer, not float
2) however, the main problem is that your query
expression does not put a delimiter between FORMAT values. Consider these examples
# concatenates values as they are
$ bcftools query -f '[ %DP]\n'
8181
# puts space in between samples
$ bcftools query -f '[%DP ]\n'
81 81
# puts comma in between samples
$ bcftools query -f '[%DP,]\n'
81,81,
In the original text I did not notice it was a multi-sample calling and the formatting expression did not have a delimiter.
I believe this should solve the problem
Thanks a lot for all the replies.
The problem was indeed in the bcftools query
command. Removing the square brackets works. Note that this is one-sample VCF.
bcftools query -f '%CHROM\t%POS\t%QUAL\t%DP\n' ${sample}.vcf.gz | head
chr1 10231 15.0929 147
chr1 180951 14.812 81
chr1 629906 32.3157 196
chr1 631210 59 72
chr1 632199 38.8419 228
chr1 634008 495 195
chr1 791624 4.13736 2
chr1 791626 4.13736 2
chr1 791638 4.13736 2
chr1 817186 8.64019 1
Still not sure why DP was repeated 3 times.
Removing the brackets makes the difference between using INFO/DP and FORMAT/DP.
Please check the documentation http://samtools.github.io/bcftools/bcftools.html#query http://samtools.github.io/bcftools/howtos/query.html
Hi,
After calling variants from a bam file using the following command:
bcftools mpileup -Ou --skip-indels -f ~/db/hg38.fa.gz ${sample}.bam | bcftools call -vmO z -o ${sample}.vcf.gz && tabix -p vcf ${sample}.vcf.gz
, I found that decimal points of DP values are all missing.I am sure the weirdly big DP values are because of missing decimal points because:
So, 147147147 should be 147.147147 and 818181 should be 81.8181.
Could you please help troubleshooting the problem? Thanks!