hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

VCF output samples from multi-fasta file #17

Open Yatish0833 opened 2 years ago

Yatish0833 commented 2 years ago

I have a bunch (~30k) of fasta sequences that I am aligning to a single reference sequence. The generated VCF file loses the identity of these ~30k sequences. Is there an option to preserve the number of samples with one to many alignments?

hsinnan75 commented 2 years ago

Could you please be more specific? I'm not quite sure what you mean by that.

Yatish0833 commented 2 years ago

I want to preserve the identity of sequences aligning to reference - For example I am currently using mafft to align sequences to reference and then generating vcf file with this alignment file using snp-sites - the corresponding vcf file looks like the following:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 Sample2 sample3 sample4.....

1 1 . G ,A,T . . AC=218,335,5,0,0,0,0,0;AN=10520 GT 0 0 0 1
1 2 . G
,N,A . . AC=216,330,1,0,0,1,0,0;AN=10520 GT 0 0 0 1
1 3 . C ,T,Y . . AC=216,332,1,0,0,0,0;AN=10520 GT 0 0 0 1
1 4 . T
,N,A,C,K . . AC=216,329,0,0,0,0,0,0;AN=10520 GT 0 0 0 1
1 5 . G ,T,N, . . AC=216,3,327,0,0,0,0,0;AN=10520 GT 0 0 0 1
1 6 . C
,N,T,Y . . AC=216,325,138,1,0,0,0,0;AN=10520 GT 0 0 0 1

Compared to this GSalign after aligning the same dataset generated the following VCF file -

CHROM POS ID REF ALT QUAL FILTER INFO

reference_sequence 16 . C CC 100 TYPE=INSERT reference_sequence 16 . C CC 100 TYPE=INSERT reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 TYPE=SUBSTITUTE reference_sequence 19 . C T 100 * TYPE=SUBSTITUTE

Even with this version VCF file have all the information but can't really identify the sample1,2,3,4 etc in the above GSalign format. I hope that clarifies the issue a little bit?

teketelah commented 2 years ago

Hi there, is this issue resolved? GSAlign vcf output doesn't have the FORMAT and genotype information.