Closed kfpbennett closed 6 months ago
This can be achieved by simply streaming the desired region into the program, for example like:
samtools faidx ref.fa chr1:1-100000 | bcftools consensus in.vcf.gz -o out.1-100000.fa
samtools faidx ref.fa chr1:100001-200000 | bcftools consensus in.vcf.gz -o out.100001-200000.fa
...
For now there is no plan to have consensus
drop regions before the first and after the last variant. In fact, the existing functionality seems better suited for the use case you've described.
Of course! This works great. Thanks for the quick response.
Hello. I'd like to suggest adding a function to bcftools consensus to only output a sequence bounded by the start and end of the input VCF file.
I am writing a script that extends the use of another tool (dfoil: https://github.com/jbpease/dfoil) which relies on a multi-sequence alignment by allowing it to run on genomic windows. To do that, I am pulling out sequences from a very large VCF based on windowed .bed files, aligning them, and feeding them into dfoil. However, when I generate the fasta from the subset VCF using bcftools consensus, I can only generate a consensus sequence for the whole genome (or the whole chromosome if I make an additional fasta). I can use the -a option to add a dummy nucleotide that I can then remove with sed, but when repeated thousands of times across all windows, it adds significantly to the total run time. Could you add something similar to an -R option to consensus so it only pulls out the region bounded by the beginning and end of the VCF?
Thank you!