hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

Conflicting Ref Alleles #23

Open genomeboy opened 6 months ago

genomeboy commented 6 months ago

I think this issue is related to something that has been mentioned previously here, but I'm not sure that it is resolved.

I am using GSalign to align and call variants from a series of assembled genomes from cultivars (individuals) within a species. I have used exactly the same reference genome file for each alignment. At some positions I am getting different REFERENCE bases in the vcf file (example below):

(ncbi_datasets) xxxx@server:~/pangenome/test$ grep 5535808 genome1.vcf chr1 5535808 . T c 100 * TYPE=SUBSTITUTE

(ncbi_datasets) xxxx@server:~/pangenome/test$ grep 5535808 genome2.vcf chr1 5535808 . A g 100 * TYPE=SUBSTITUTE

The correct ref allele is "A" in this case.

I am using GSAlign v1.0.22. Any help would be most gratefully accepted as I need to do a fair number of these alignments for a large genomes and GS align seems to be by far the fastest tool available for this.

Oh yes, one other thing, could you possibly fix the header exported so that the "*" in the filter field is recognized by bcftools (for the eventual VCF merge operation)?

many thanks

rderelle commented 3 months ago

I'm seeing the same issue with some bacterial genomes. It looks like a bug. My guess is that GSAlign gets confused when contigs are aligned on their reverse-complements.