lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
487 stars 133 forks source link

msa2vcf can't use given Ref #252

Closed NailouZhang closed 3 months ago

NailouZhang commented 3 months ago

Hi, When I try MsaToVcf to get vcf file from msa. I set " msa2vcf.py --REF Seq1 NN.fsa -o msa2vcf.vcf ", in msa2vcf.vcf, the #CHROM is Seq1, But the Ref not the sequence from Seq1. In example of msa2vcf (https://jvarkit.readthedocs.io/en/latest/MsaToVcf/), the Ref seems to be the consensus fasta or others but not Seq1.

How Can I make the Ref with sequence form Seq1/Ref ?

lindenb commented 3 months ago

Hi,

msa2vcf.py

I didn't write such program, there is no such option --REF

NailouZhang commented 3 months ago

I test https://anaconda.org/bioconda/jvarkit-msa2vcf

lindenb commented 3 months ago

I see, I didn't write that conda package.

Please use the latest java archive, and provide a reproducible example please.

NailouZhang commented 3 months ago

Hi, I get the same results with "jvarkit msa2vcf -R Contig112 -o aaa.vcf aaa_alignment.aln [INFO][MsaToVcf]format : Clustal " Can you help me with this?


jvarkit --help JVARKIT

Author : Pierre Lindenbaum Phd. Institut du Thorax. Nantes. France. Version : undefined Compilation : 20240801122845 Github : https://github.com/lindenb/jvarkit Issues : https://github.com/lindenb/jvarkit/issues

Usage

  java -jar jvarkit.jar [options]

or

  java -jar jvarkit.jar <command name> (other arguments)

Options

aaa_alignment.txt

lindenb commented 3 months ago

I get the expected output where the param -R is used as the name of the chromosome in the 'CHROM' column. .

java -jar jvarkit.jar msa2vcf -R Contig112 input.txt  | grep -v "##" | cut -f1-20 | head
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Contig112   HQ834507.1  JF421281.1  JF421282.1  JQ665905.1  KC844226.1  KC844227.1  KY283955.1KY283956.1    MN478383.1  MN478385.1
Contig112   0   .   TAGTAGTAGACTCCCTAAAGAGCTACTATAACAACG    ACAACG,TAGTAGTAGACTCCCTAAAGAGCTACGATAACAACA,TAGTAGTAGGCTCCCTAAAGAGCTACTATAACAACG,TAGTAGTAGACTCCCTAAAGAGCTACTATAACAACA,TAGTAGTATACTCCCTAAAAAGCTACTGTAACAACG,GACCCCTTTCGTGATAGTGAATACGGTCGTAAGAGCTACTATAANACG,TGTGTGTGAGTGATGTATCGATTAAAGAGCTACTATANCAACG .   .   DP=18   GT:DP   1/1:1   3/3:1   0/0:1   0/0:1   5/5:1   0/0:1   0/0:1   6/6:1   7/7:1   0/0:1   0/0:1
Contig112   54  .   GCAA    GCAT,GGCA   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   1/1:1   0/0:1
Contig112   76  .   G   A   .   .   DP=23   GT:DP   1/1:1   0/0:1   0/0:1   0/0:1   1/1:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1
Contig112   95  .   G   A   .   .   DP=23   GT:DP   0/0:1   1/1:1   1/1:1   1/1:1   0/0:1   1/1:1   1/1:1   1/1:1   1/1:1   0/0:1   0/0:1
Contig112   98  .   C   T   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   1/1:1   1/1:1
Contig112   102 .   C   T   .   .   DP=23   GT:DP   1/1:1   0/0:1   0/0:1   0/0:1   1/1:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1
Contig112   104 .   G   A   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1
Contig112   107 .   G   A   .   .   DP=23   GT:DP   1/1:1   0/0:1   0/0:1   0/0:1   1/1:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1
Contig112   125 .   G   A   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1
NailouZhang commented 3 months ago

I expect results that in column REF: the sequence should be from Contig112, then it should be ACAACG, here it is TAGTAGTAGACTCCCTAAAGAGCTACTATAACAACG

so the sequence in the column REF is not from Contig112.

lindenb commented 3 months ago

not sure I understand (I wrote this program years ago and I never use it...) , may be you want option -c "use sequence xxxx as consensus":


$ java -jar jvarkit.jar msa2vcf -c Contig112 input.txt  | grep -v "##" | cut -f1-20 | head

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Contig112   HQ834507.1  JF421281.1  JF421282.1  JQ665905.1  KC844226.1  KC844227.1  KY283955.1KY283956.1    MN478383.1  MN478385.1
chrUn   0   .   ACAACG  TAGTAGTAGACTCCCTAAAGAGCTACTATAACAACG,TAGTAGTAGACTCCCTAAAGAGCTACGATAACAACA,TAGTAGTAGGCTCCCTAAAGAGCTACTATAACAACG,TAGTAGTAGACTCCCTAAAGAGCTACTATAACAACA,TAGTAGTATACTCCCTAAAAAGCTACTGTAACAACG,GACCCCTTTCGTGATAGTGAATACGGTCGTAAGAGCTACTATAANACG,TGTGTGTGAGTGATGTATCGATTAAAGAGCTACTATANCAACG   .   .   DP=18   GT:DP   0/0:1   3/3:1   1/1:1   1/1:1   5/5:1   1/1:1   1/1:1   6/6:1   7/7:1   1/1:1   1/1:1
chrUn   54  .   GCAA    GCAT,GGCA   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   1/1:1   0/0:1
chrUn   76  .   A   G   .   .   DP=23   GT:DP   0/0:1   1/1:1   1/1:1   1/1:1   0/0:1   1/1:1   1/1:1   1/1:1   1/1:1   1/1:1   1/1:1
chrUn   95  .   G   A   .   .   DP=23   GT:DP   0/0:1   1/1:1   1/1:1   1/1:1   0/0:1   1/1:1   1/1:1   1/1:1   1/1:1   0/0:1   0/0:1
chrUn   98  .   C   T   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   1/1:1   1/1:1
chrUn   102 .   T   C   .   .   DP=23   GT:DP   0/0:1   1/1:1   1/1:1   1/1:1   0/0:1   1/1:1   1/1:1   1/1:1   1/1:1   1/1:1   1/1:1
chrUn   104 .   G   A   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1
chrUn   107 .   A   G   .   .   DP=23   GT:DP   0/0:1   1/1:1   1/1:1   1/1:1   0/0:1   1/1:1   1/1:1   1/1:1   1/1:1   1/1:1   1/1:1
chrUn   125 .   G   A   .   .   DP=23   GT:DP   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1   0/0:1
NailouZhang commented 3 months ago

yes, The results are what I wanted. Thank you very much.