lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
481 stars 133 forks source link

Bug in msa2vcf ? #232

Open wwood opened 1 year ago

wwood commented 1 year ago

Subject of the issue

Output seems incorrect.

Your environment

On Linux, see output for version info

Steps to reproduce

With this input:

$ head ~/t/eg.aln ~/t/eg_ref.fna
==> /home/woodcrob/t/eg.aln <==
>ref
AGGATTACTCGCTTTTTGTTAGTCATCGGTATCTGGACCCTGATAGCTGTTGCTTCCCAC
>mut
AGCATTACT-GCTTTTTGTTAGTCATCGGTATCTGGACCCCGATAGCTGATGCTTCCCAC

==> /home/woodcrob/t/eg_ref.fna <==
>ref
AGGATTACTCGCTTTTTGTTAGTCATCGGTATCTGGACCCTGATAGCTGTTGCTTCCCAC

and running

$ java -jar /mnt/hpccs01/home/woodcrob/bioinfo/jvarkit/dist/jvarkit.jar msa2vcf -R ~/t/eg_ref.fna ~/t/eg.aln 
[INFO][MsaToVcf]format : Fasta
##fileformat=VCFv4.2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##contig=<ID=/home/woodcrob/t/eg_ref.fna,length=60>
##msa2vcf.meta=compilation:20230331082601 githash:8e5f425de htsjdk:3.0.4 date:20230331125743 cmd:-R /home/woodcrob/t/eg_ref.fna /home/woodcrob/t/eg.aln
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  mut     ref
/home/woodcrob/t/eg_ref.fna     3       .       C       G       .       .       DP=2    GT:DP   0/0:1   1/1:1
/home/woodcrob/t/eg_ref.fna     9       .       TC      T       .       .       DP=2    GT:DP   1/1:1   0/0:1
/home/woodcrob/t/eg_ref.fna     41      .       C       T       .       .       DP=2    GT:DP   0/0:1   1/1:1
/home/woodcrob/t/eg_ref.fna     50      .       A       T       .       .       DP=2    GT:DP   0/0:1   1/1:1

The first 2 are right, but the 2nd two aren't - they are T->C not C->T, and T->A not A->T. Did I get that right? Here's the blast output to help visualise

REF| Query  1   AGGATTACTCGCTTTTTGTTAGTCATCGGTATCTGGACCCTGATAGCTGTTGCTTCCCAC  60
                || |||||| |||||||||||||||||||||||||||||| |||||||| ||||||||||
ALT| Sbjct  1   AGCATTACT-GCTTTTTGTTAGTCATCGGTATCTGGACCCCGATAGCTGATGCTTCCCAC  59

Otherwise this tools seems like exactly what we need, so any help much appreciated. Thanks, ben

wwood commented 1 year ago

FWIW, it seems like that switch happens after an indel, as observed with a second example not shown.

lindenb commented 1 year ago

Hi, I think there is a misunderstanding here:

when using a fasta as input, the REFERENCE among the fasta records must be specified using option -R. Otherwise the most frequent base is used as ref.

then option -f is used to save the consensus:

    -f, --fasta
      save computed fasta sequence in this file.
lindenb commented 1 year ago

furthermore, you should also have a look at : https://github.com/sanger-pathogens/snp_sites

wwood commented 1 year ago

Hi,

Thanks for the quick response. You're right, I hadn't quite appreciated about the consensus. From the help it says

    -R, --REF
      reference name used for the CHROM column. Optional
      Default: chrUn

So as I understand, that only changes what appears in the first column, and doesn't impact what the sequence of the reference is. Are you saying it also changes the consensus?

Anyway, I tried to workaround by adding the same ref sequence twice, slightly changing the name of one to maintain uniqueness. Then finally the non-ref sequence. So the consensus should always be the reference.

That worked for the above example, but it seemed to do something unexpected when the reference has gaps. Specifically, including gaps, the input sequences were 5002 bp, and 5000 bp not including them. I was expecting the consensus to be 5000 bp, but it comes out as 5002 bp. Is that expected behaviour? If you don't quite understand I can cook up a reduced example as above?

snp_sites is what I tried first, but that doesn't seem to have any option to report INDELs, only SNPs. Maybe there's something I missed?

Anyway, happy to help further here if helpful for you, but I just decided to write this code myself, since I only have 2 sequences for my use-case. Let me know.