samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

Missing decimal points in DP #2075

Closed tkn132 closed 6 months ago

tkn132 commented 6 months ago

Hi,

After calling variants from a bam file using the following command: bcftools mpileup -Ou --skip-indels -f ~/db/hg38.fa.gz ${sample}.bam | bcftools call -vmO z -o ${sample}.vcf.gz && tabix -p vcf ${sample}.vcf.gz, I found that decimal points of DP values are all missing.

bcftools query -f '%CHROM\t%POS\t%QUAL\t[%DP]\n' PNG1208F.vcf.gz | head
chr1    10231   15.0929 147147147
chr1    180951  14.812  818181
chr1    629906  32.3157 196196196
chr1    631210  59      727272
chr1    632199  38.8419 228228228
chr1    634008  495     195195195
chr1    791624  4.13736 222
chr1    791626  4.13736 222
chr1    791638  4.13736 222
chr1    817186  8.64019 111

I am sure the weirdly big DP values are because of missing decimal points because:

bcftools view -i 'DP<100' ${sample}.vcf.gz | bcftools query -f '%CHROM\t%POS\t%QUAL\t[%DP]\n'  | head
chr1    180951  14.812  818181
chr1    631210  59      727272
chr1    791624  4.13736 222
chr1    791626  4.13736 222
chr1    791638  4.13736 222
chr1    817186  8.64019 111
chr1    839543  5.45831 111
chr1    876061  6.8953  333
chr1    876081  8.64019 111
chr1    876355  8.64019 111

So, 147147147 should be 147.147147 and 818181 should be 81.8181.

Could you please help troubleshooting the problem? Thanks!

jkbonfield commented 6 months ago

It would be interesting to see the full line of text in the vcf.gz file rather than re-reading it and printing it out agvain with query.

Something like zgrep chr1.180951 sample.vcf.gz would give you the full line. (Doesn't need to run to completion - it could take ages, but this is a line early on in the file so once it's printed up you'll know)

That would tell us whether the original mpileup / call commands worked or not.

Edit: also reporting bcftools --verson is always useful when reporting problems so we know if this is an up to date binary.

Also, note DP isn't floating point. So it's not a case of decimal places missing. It feels more like some parsing/printing string stutter or BCF byte-packing bug to me. It's notable that all of these wrong values are repeating the same string.

pd3 commented 6 months ago

DP is defined as Type=Integer, not as Type=Float, therefore there will be no decimal points. This information can be found in the VCF header.

jkbonfield commented 6 months ago

Actually, we can cut down all the way to the initial mpileup too:

bcftools mpileup --skip-indels -f ~/db/hg38.fa.gz -r  chr1:180951  ${sample}.bam | grep 180951

That is purely reading BAM and outputting VCF. So there is no re-reading of the VCF stream into BCF-internals and no potential for incorrect in-memory BCF processing (I think).

Next would be mpileup | call, which has an additional VCF->BCF->VCF step:

bcftools mpileup --skip-indels -f ~/db/hg38.fa.gz -r  chr1:180951  ${sample}.bam  | bcftools call -vm - | grep 180951

That would output VCF from mpileup and then call the variant using call (which reads VCF into BCF-in-memory, processes it, and then reformats back out as VCF again).

Finally that same command can be done without mpileup outtping as VCF, using uBCF directly (as you did):

bcftools mpileup -Ou --skip-indels -f ~/db/hg38.fa.gz -r  chr1:180951  ${sample}.bam  | bcftools call -vm - | grep 180951

I think those commands are correct anyway! Petr could verify. Basically it's an attempt to try and minimise the problem and work out at what stage "81" ended up looking like "818181".

pd3 commented 6 months ago

@tkn132 Sorry I did not read the entire thread thoroughly. Please ignore everything what was said above, both mine and @jkbonfield responses.

There are two problems:

1) as stated above, DP is integer, not float 2) however, the main problem is that your query expression does not put a delimiter between FORMAT values. Consider these examples

# concatenates values as they are
$ bcftools query -f '[ %DP]\n'
8181

# puts space in between samples
$ bcftools query -f '[%DP ]\n'
81 81

# puts comma in between samples
$ bcftools query -f '[%DP,]\n'
81,81,

In the original text I did not notice it was a multi-sample calling and the formatting expression did not have a delimiter.

I believe this should solve the problem

tkn132 commented 6 months ago

Thanks a lot for all the replies.

The problem was indeed in the bcftools query command. Removing the square brackets works. Note that this is one-sample VCF.

bcftools query -f '%CHROM\t%POS\t%QUAL\t%DP\n' ${sample}.vcf.gz | head
chr1    10231   15.0929 147
chr1    180951  14.812  81
chr1    629906  32.3157 196
chr1    631210  59      72
chr1    632199  38.8419 228
chr1    634008  495     195
chr1    791624  4.13736 2
chr1    791626  4.13736 2
chr1    791638  4.13736 2
chr1    817186  8.64019 1

Still not sure why DP was repeated 3 times.

pd3 commented 6 months ago

Removing the brackets makes the difference between using INFO/DP and FORMAT/DP.

Please check the documentation http://samtools.github.io/bcftools/bcftools.html#query http://samtools.github.io/bcftools/howtos/query.html