samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
671 stars 240 forks source link

Warning Sequence not in vcf.gz, leading to no variant applied #1797

Closed wk1267 closed 2 years ago

wk1267 commented 2 years ago

Hi, I am trying to write a pipeline that does lofreq variant calling on a bam file sample, then get the consensus sequence from the vcf file generated. Lofreq results show that substitution tests are performed, indicating that there should be variants called.

Number of substitution tests performed: 88908
Number of indel tests performed: 0

But, when I do consensus calling, I encounter the following warning message. (This ends up giving me the original ref.fa file).

Note: the --sample option not given, applying all records regardless of the genotype
Warning: Sequence "ref|NC_045512.2|" not in ERR7931092.vcf.gz
Applied 0 variants

Please advise. Thank you!

For reference, this is the code I've used:

#index reference genome
samtools faidx ref.fa
#align with bwa mem
bwa mem ref.fa ERR7931092.fastq | samtools sort -o ERR7931092.bam; 
#call lofreq
samtools index ERR7931092.bam; lofreq call-parallel --pp-threads 4 -f ref.fa -o ERR7931092.vcf ERR7931092.bam; 
#index vcf files
bgzip -c ERR7931092.vcf > ERR7931092.vcf.gz; tabix -p vcf ERR7931092.vcf.gz; 
#get consensus sequence
bcftools consensus -f ref.fa ERR7931092.vcf.gz > ERR7931092.fa

Running bcftools query -f'%CHROM:%POS %REF %ALT\n' ERR6671299.vcf.gz gives me:

ref|NC_045512.2|:1-29903:64  C A
ref|NC_045512.2|:1-29903:71  C T
ref|NC_045512.2|:1-29903:75  C T

Meanwhile, my ref.fa begins with: >ref|NC_045512.2|:1-29903 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

These are the versions used: bcftools 1.15.1 Using htslib 1.15.1

pd3 commented 2 years ago

This is caused by the unfortunate naming of your contigs.

bcftools consensus accepts fragments of fasta files on input which is handy when one needs consensus of, say, a list of genes. In such cases, the contig names in the fast file would be named >chr:beg-end. Now you see the problem. In your case the :beg-end is part of the contig name and thus the contig name is interpreted as ref|NC_045512.2|.

The easiest solution is to drop the :1-29903 part from the name. Not sure how to solve this best on bcftools side. Maybe by adding an option forbidding streaming of contig fragments.

Another possibility would be to add a check to confirm if ref|NC_045512.2|:1-29903 or ref|NC_045512.2| is the contig name. However, this would fail if the VCF contained both ref|NC_045512.2|:1-29903 and ref|NC_045512.2|.

jmarshall commented 2 years ago

The file formats group spent a while thinking about these issues, and there is useful advice in Appendix A of the SAM spec. I think some of the HTSlib API functions implement the ideas there too — namely the “confirm which one is a contig name” check @pd3 mentions and the curly bracket syntax.

pd3 commented 2 years ago

The appendix suggests what I proposed, the ambiguous case highlighted above would remain unresolved. The curly bracket syntax is good for use with command line interface, but not in the fasta file headers. Your best bet @wk1267 is to rename the contigs, for example replacing the colon with an underscore.

wk1267 commented 2 years ago

Thank you for your responses! I have renamed the contigs using bcftools annotate. This was able to solve the problem.