lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
476 stars 131 forks source link

indel calls with msa2vcf? #219

Closed chazgoo closed 1 year ago

chazgoo commented 1 year ago

Verify

verified, program runs fine (jdk11 assigned correctly to $PATH)

Subject of the issue

INDEL calls not included in output

Your environment

Steps to reproduce

Ran software on a whole genome alignment of 50 x 140kb samples as follows

cat msa.fasta | java -jar ~/software/jvarkit/dist/msa2vcf.jar -R reference -m > test.vcf 

Also attempted -a, and using reference as -c arg

Expected behaviour

To include INDELs in the output VCF

Actual behaviour

SNPs seem to show up fine, but larger INDELs are absent. From the documentation, there doesn't seem to be a setting I can modify. Is there anything I can do about this?

lindenb commented 1 year ago

Hi, It's hard to answer without seeing the content of msa.fa . Anyway, this tool has been superseded by snp_sites , you should try it: https://github.com/sanger-pathogens/snp_sites

chazgoo commented 1 year ago

Thank you for the reply!

I need to offer a mea culpa - this was an issue of reference-based positions vs. alignment-based positions. Your program is making the expected calls with respect to the alignment position. I happen to be working with a highly gapped sample set, and my answer was a few thousand bases from my expected position.

I'm going to close this issue, and as a parting thought, if you ever wanted to add a feature that allows positions to be called based on the un-gapped position of one of the included sequences, you'd have my undying gratitude. As it is, I've written a python script that reassigns positions to handle that - if you're interested!

Thank you again for your attention here!