ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

'DP' FORMAT header not found #112

Closed pileei closed 1 month ago

pileei commented 1 month ago

Dear Kieran,

I generated AllSites VCFs using BCFtools as described in the Generating invariant sites VCFs section and when I run pixy I encounter this error:

io/vcf_read.py:1248: UserWarning: 'DP' FORMAT header not found
  warnings.warn('%r FORMAT header not found' % name)

[pixy] WARNING: pixy failed to find any valid gentoype data to calculate the following summary statistics: fst. No output file will be created for these statistics.

The header looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.19+htslib-1.19
##bcftoolsCommand=mpileup -f /data/GCA_013435755.1/GCA_013435755.1_ASM1343575v1_genomic.fna -b bamlist.txt -r CM024214.1
##reference=file:/data/GCA_013435755.1/GCA_013435755.1_ASM1343575v1_genomic.fna
##contig=<ID=CM024192.1,length=43087560>
##contig=<ID=CM024193.1,length=37655351>
##contig=<ID=CM024194.1,length=62474119>
##contig=<ID=CM024195.1,length=33436375>
##contig=<ID=CM024196.1,length=39570627>
##contig=<ID=CM024197.1,length=43153764>
##contig=<ID=CM024198.1,length=29956463>
##contig=<ID=CM024199.1,length=32221801>
##contig=<ID=CM024200.1,length=35827127>
##contig=<ID=CM024201.1,length=40343243>
##contig=<ID=CM024202.1,length=38942605>
##contig=<ID=CM024203.1,length=33684897>
##contig=<ID=CM024204.1,length=42035881>
##contig=<ID=CM024205.1,length=33808761>
##contig=<ID=CM024206.1,length=37174816>
##contig=<ID=CM024207.1,length=25990339>
##contig=<ID=CM024208.1,length=37624702>
##contig=<ID=CM024209.1,length=30829707>
##contig=<ID=CM024210.1,length=37146573>
##contig=<ID=CM024211.1,length=34332003>
##contig=<ID=CM024212.1,length=32234835>
##contig=<ID=CM024214.1,length=16521>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric, http://samtools.github.io/bcftools/rd-SegBias.pdf">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.19+htslib-1.19
##bcftools_callCommand=call -m -Oz -f GQ -o 89.all-MT.vcf.gz; Date=Wed Jul  3 03:06:00 2024

The vcf file first two lines are:

CM024192.1  1   .   A   .   58.9855 .   DP=1;MQ0F=0;AN=2;DP4=1,0,0,0;MQ=60  GT  ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
CM024192.1  2   .   C   .   58.9855 .   DP=1;MQ0F=0;AN=2;DP4=1,0,0,0;MQ=60  GT  ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ././.   ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.

Have you encounter this error before?

Thanks in advance!

ksamuk commented 1 month ago

Hi there, looks like your VCF is not valid because it is missing the header entry for the DP FORMAT field. A quick fix would just be to add it in manually.

pileei commented 1 month ago

Thanks for your answer. I'm new in this field and I don't know how to fix the vcf the way you mentioned.

I added the line ##FORMAT= to the vcf header and pixy was able to run. When it ended, all avg_pi and avg_dxy values were NA values.

I think that I will need to add the DP at the column of the FORMAT in the vcf but I do not know how to do it.

Again, many thanks for your help.