hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

Alignment coordinates out of range #2

Closed kingralph80 closed 4 years ago

kingralph80 commented 4 years ago

Hi,

We continued using GSAlignment but when converting the maf output to axt or psl we found the following error: Coordinates out of range line 3034523 of B73v5.CML322.50.100.full.maf

The alignment in question was:

a score=111 s ref.scaf_395 59559 111 + 59635 ttttcataaaaaatgggggttgtgtggccatttatcatcgactagaggctcataaacctcaccccacatatgtttccacattcttggatttctggtggagaccatttcttg s qry.scaf_63 310118 111 + 315365 ttttcataaaaaatgggggttgtgtggccatttatcatcgactagaggctcataaacctcaccccacatatgtttccacattcttggatttctggtggagaccatttcttg

( s ref.scaf_395 59559 111 + 59635 ) seems indeed out of range as 59559 +111 is larger than 59635. Is it possible that this is a bug when choosing the coordinate to print? We used the latest commit GenAlign v1.0.16

Cheers.

kingralph80 commented 4 years ago

In case it helps, here is a second alignment that caused the same error:

a score=274 s ref.scaf_62 133360 274 + 133633 tattattgaaaatggtcgctcatggctattttcaaggtcgctcatggctattttcataaaaaatgggggttgtgtggccatttatcatcgactagaggctcataaacctcaccccacatatgtttccttgccatagattacattcttggatttctggtggaaaccatttcttggttaaaaactcgtacgtgttagccttcggtattattgaaaatggtcattcatggctattttcggcaaaatgggggttgtgtggccattgatcgtcgaccaa s qry.scaf_462 1559 274 + 67495 tattattgaaaatggtcgctcatggctattttcaaggtcgctcatggctattttcataaaaaatgggggttgtgtggccatttatcatcgactagaggctcataaacctcaccccacatatgtttccttgccatagattacattcttggatttctggtggaaaccatttcttggttaaaaactcgtacgtgttagccttcggtattattgaaaatggtcattcatggctattttcggcaaaatgggggttgtgtggccattgatcgtcgaccaa

Here is seems to be out of range by 1 bp. (GenAlign v1.0.16)

hsinnan75 commented 4 years ago

Thank you for reporting the problematic cases. Could you show me where I can download the two genome sequences? It'd be better if I could have the two genomes for debugging. Thank you!

kingralph80 commented 4 years ago

I send you via email. Let me know in case you did not get it or download from gdrive gives an error.

kingralph80 commented 4 years ago

I send you the links,

please let me know if there are rights issues for the download.

From: "hsinnan" notifications@github.com To: "hsinnan75/GSAlign" GSAlign@noreply.github.com Cc: "Thomas Hartwig" thartwig@mpipz.mpg.de, "Author" author@noreply.github.com Sent: Saturday, 2 May, 2020 12:55:28 Subject: Re: [hsinnan75/GSAlign] Alignment coordinates out of range (#2)

Thank you for reporting the problematic cases. Could you show me where I can download the two genome sequences? It'd be better if I could have the two genomes for debugging. Thank you!

— You are receiving this because you authored the thread. Reply to this email directly, [ https://github.com/hsinnan75/GSAlign/issues/2#issuecomment-622934931 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AMRZNDPIPIEFZ6TRPE6A2Z3RPP32BANCNFSM4MXS4MWQ | unsubscribe ] .

-- Group leader crop yield, Frommer AG Heinrich-Heine-Universität Düsseldorf / Max Planck Institute for Plant Breeding Research Carl-von-Linné-Weg 10 50829 Cologne, Germany thartwig@mpipz.mpg.de Tel.: +49 02215062385

hsinnan75 commented 4 years ago

Thank you for the test data. I've found and fixed a bug. Please update GSAlign to 1.0.18. The bug was due to the matching strings may mistakenly span multiple reference sequences. Thank you for letting me know this bug.

hsinnan75 commented 4 years ago

Hi, GSAlign was previously designed to perform one on one alignment, that is it only aligned to the most similar reference sequence. In the updated version (1.0.18), I removed this strategy and let GSAlign aligns a query sequence to all locally similar sequences. However, it would take much longer time if the two genomes have many duplicons (repetitive sequences). Thus, in the latest version (1.0.19), I added an option (-one) to let user decide which alignment mode GSAlign performs. If "-one" is set, GSAlign will perform one-on-one alignment, otherwise, it will perform all-against-all alignment.

kingralph80 commented 4 years ago

Thank you a lot! The new output from 1.0.19 did not cause any error converting it to chain. Also the addition of the new alignment mode is very appreciated. The alignment step did not take much longer, maybe a few extra minutes, but mapping length increase ~25-30%.

When I compare now the uplift of variants between GSAlign and progressive cactus, both lifted almost the same amount of variants! This is great as cactus is very good but took us almost 2 weeks run time.

There are still some bugs with the VCF output but I would open a new thread as this problem has been fixed and can be closed