Open Yatish0833 opened 2 years ago
Could you please be more specific? I'm not quite sure what you mean by that.
I want to preserve the identity of sequences aligning to reference - For example I am currently using mafft to align sequences to reference and then generating vcf file with this alignment file using snp-sites - the corresponding vcf file looks like the following:
1 1 . G ,A,T . . AC=218,335,5,0,0,0,0,0;AN=10520 GT 0 0 0 1
1 2 . G ,N,A . . AC=216,330,1,0,0,1,0,0;AN=10520 GT 0 0 0 1
1 3 . C ,T,Y . . AC=216,332,1,0,0,0,0;AN=10520 GT 0 0 0 1
1 4 . T ,N,A,C,K . . AC=216,329,0,0,0,0,0,0;AN=10520 GT 0 0 0 1
1 5 . G ,T,N, . . AC=216,3,327,0,0,0,0,0;AN=10520 GT 0 0 0 1
1 6 . C ,N,T,Y . . AC=216,325,138,1,0,0,0,0;AN=10520 GT 0 0 0 1
Compared to this GSalign after aligning the same dataset generated the following VCF file -
reference_sequence 16 . C CC 100 TYPE=INSERT reference_sequence 16 . C CC 100 TYPE=INSERT reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 * TYPE=SUBSTITUTE
Even with this version VCF file have all the information but can't really identify the sample1,2,3,4 etc in the above GSalign format. I hope that clarifies the issue a little bit?
Hi there, is this issue resolved? GSAlign vcf output doesn't have the FORMAT and genotype information.
I have a bunch (~30k) of fasta sequences that I am aligning to a single reference sequence. The generated VCF file loses the identity of these ~30k sequences. Is there an option to preserve the number of samples with one to many alignments?