samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

bcftools consensus --mark-ins #2124

Open loieeyet opened 3 months ago

loieeyet commented 3 months ago

Hello,

I'm a researcher who has been using bcftools effectively. I was very grateful for the update you provided last time that allowed inserting characters into mark-del. This time, I'd like to discuss the functionality of mark-ins.

Here's an example of a VCF for a single sample:

chr1 786694 - A ATT . PASS AC=- GT 0|1

When running the following command:

samtools faidx Homo_sapiens_assembly38.fasta chr1:786694-786700 | 
bcftools consensus \
--mark-ins lc \ 
--mark-del d \
-H 1 \  # or 2
-s  sample1 \
test.vcf.gz

With H set to 1, the output is ACGCAT, and with H set to 2, the output becomes AttCGCAT.

In this scenario, I would like the output to be A--CGCAT when H is set to 1. In situations of heterozygosity where one allele has an insertion, I'd like to fill in specific characters to ensure both sequences are of equal length.

Do you have plans to support this feature? If not, I would be very grateful if you could direct me to which part to refer to or modify so that I can implement this feature myself.

pd3 commented 3 months ago

How would it work if the genotype was, say, triploid or tetraploid, and there were multiple indels of different length? I am struggling to define the behavior for a generalized case.

loieeyet commented 3 months ago

Considering I was mainly focused on human haploid situations, I realize it's quite a task for you to design a tool that covers generalized cases.

Thinking it over, it seems challenging to add new functionality to the --mark-ins option. What about introducing an option called --pad-insertion CHAR? This would pad sequences at insertion points with a specific CHAR to make lengths consistent, except for the alt allele. It would specifically apply to phased data and require a certain --haplotype N option to be set. For triploid or tetraploid scenarios, this option could take into account the various genotypes, producing a consensus sequence that's appropriately padded. And when dealing with multiple indels of differing lengths, maybe padding according to the longest indel might be a good strategy.