samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

Suggested function: bcftools consensus region #2080

Closed kfpbennett closed 6 months ago

kfpbennett commented 6 months ago

Hello. I'd like to suggest adding a function to bcftools consensus to only output a sequence bounded by the start and end of the input VCF file.

I am writing a script that extends the use of another tool (dfoil: https://github.com/jbpease/dfoil) which relies on a multi-sequence alignment by allowing it to run on genomic windows. To do that, I am pulling out sequences from a very large VCF based on windowed .bed files, aligning them, and feeding them into dfoil. However, when I generate the fasta from the subset VCF using bcftools consensus, I can only generate a consensus sequence for the whole genome (or the whole chromosome if I make an additional fasta). I can use the -a option to add a dummy nucleotide that I can then remove with sed, but when repeated thousands of times across all windows, it adds significantly to the total run time. Could you add something similar to an -R option to consensus so it only pulls out the region bounded by the beginning and end of the VCF?

Thank you!

pd3 commented 6 months ago

This can be achieved by simply streaming the desired region into the program, for example like:

samtools faidx ref.fa chr1:1-100000 | bcftools consensus in.vcf.gz -o out.1-100000.fa
samtools faidx ref.fa chr1:100001-200000 | bcftools consensus in.vcf.gz -o out.100001-200000.fa
...

For now there is no plan to have consensus drop regions before the first and after the last variant. In fact, the existing functionality seems better suited for the use case you've described.

kfpbennett commented 6 months ago

Of course! This works great. Thanks for the quick response.