hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

VCF output and ref. /qry. #3

Closed kingralph80 closed 4 years ago

kingralph80 commented 4 years ago

HI,

thanks for updating GSAlign so quickly. Its really an amazingly fast Genome Alignment tool!

When we checked the VCF output from GSAlign, many entries are duplicated. In addition, we found many additional SNPs in alignment that did not have any SNPs. It is coming from multi-alignments?

In the past you added ref and qry to allow easy handling of same chromosome names. This was a great addition, however, most tools such as mafToPsl or mafToAxt only support . instead of .Could you change the ref and qry_ to ref. and qry. ?

hsinnan75 commented 4 years ago

Hi, I added a new option (-unique) to only output unique alignments (version 1.0.21). It should remove all redundant alignments. And I've changed ref and qry to ref. and qry.

hsinnan75 commented 4 years ago

I fixed a bug in determining the uniqueness of alignments. Please update GSAlign to version 1.0.21. Thank you! Non-unique alignments in MAF file will be annotated with "a score=1." You could skip those alignments in variant calling.

kingralph80 commented 4 years ago

Hi. Thanks. How does -unique work? So it will only put out the best alignment? What if two alignments get the same score?

hsinnan75 commented 4 years ago

It simply checks if a query fragment have multiple alignments against the reference sequence. Just like short read alignments, some reads may have multiple hits if they come from a repetitive sequence region. So given two alignments, say Aln1(qry_pos1, qry_pos2, ref_pos1, ref_pos2) and Aln2(qry_pos3, qry_pos4, ref_pos3, ref_pos4), if qry_pos1 = qry_pos3 and qry_pos2 = qry_pos4, then the two alignments are considered not unique. In such cases, Aln2 will be removed and the alignment score of Aln1 will be assigned to 1 if -unique is set. However, if two alignments are highly overlapped, that is they share a large portion of query block or reference block, they are also considered not unique. In such cases, the shorter one will be removed.

If two alignments get the same score, it is not necessarily they are not unique. Uniqueness are based on the alignment regions, rather than the alignment scores.

kingralph80 commented 4 years ago

Thank you!