thegenemyers / DAMAPPER

Long read to reference genome mapping tool
Other
13 stars 9 forks source link

duplicate LAs in chains #7

Open egoltsman opened 7 years ago

egoltsman commented 7 years ago

Hello, I'm trying out DAMAPPER for mapping a set of unitigs to PacBio reads, and when parsing the results I notice that sometimes duplicate alignments to different parts of the same read are reported as chains, which seems like not the intended behavior. Perhaps this has to do with the fact that I'm aligning short sequences to long reads, and not vice versa? Here's an example from the LAdump -lc output:

P 6661 109492 n > L 83 12383 C 0 83 11853 11937 P 6661 109492 n - L 83 12383 C 0 83 11948 12032

Here, the query contig '83' aligned fully to two locations on read '109492', which should probably be reported as two separate LAs, and not as a chain.

Eugene G

thegenemyers commented 7 years ago

Yes, that's not right. Its good that Damapper finds both matches, but it should not be a chain. Any chance you could send me a minimal example that produces the problem? For example, if building a DAM for read 6661 and another for read 109492, and comparing produced the problem then sending me .fasta's of the two reads would suffice to let me reproduce the problem. (You can produce a .fasta of one or more reads in a DB or DAM with "DBshow"). Please advise. -- Gene

On 5/18/17, 1:58 AM, Eugene Goltsman wrote:

Hello, I'm trying out DAMAPPER for mapping a set of unitigs to PacBio reads, and when parsing the results I notice that sometimes duplicate alignments to different parts of the same read are reported as chains, which seems like not the intended behavior. Perhaps this has to do with the fact that I'm aligning short sequences to long reads, and not vice versa? Here's an example from the LAdump -lc output:

P 6661 109492 n > L 83 12383 C 0 83 11853 11937 P 6661 109492 n - L 83 12383 C 0 83 11948 12032

Here, the query contig '83' aligned fully to two locations on read '109492', which should probably be reported as two separate LAs, and not as a chain.

Eugene G

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DAMAPPER/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNoLio_sUBRqgZSWYgBJxHCsMDoWfks5r64mxgaJpZM4Nek-q.

egoltsman commented 7 years ago

Hi Gene, Thanks for getting back to me on this one. I did as you instructed and could reproduce this behavior. Please see the fastas attached.

I ran into another issue in the meantime. It looks like if LAshow and LAdump are ran with the -o option to weed out impromer overlaps, this filter gets applied to chain fragments and not entire chains, i.e you can end up with a chain 'continuation' alignment reported without the start of the chain:

P 517 2878 c > P 519 776 n - P 521 2226 c >

Thanks!

Eugene

On Wed, May 24, 2017 at 8:44 AM, Eugene W Myers Jr <notifications@github.com

wrote:

Yes, that's not right. Its good that Damapper finds both matches, but it should not be a chain. Any chance you could send me a minimal example that produces the problem? For example, if building a DAM for read 6661 and another for read 109492, and comparing produced the problem then sending me .fasta's of the two reads would suffice to let me reproduce the problem. (You can produce a .fasta of one or more reads in a DB or DAM with "DBshow"). Please advise. -- Gene

On 5/18/17, 1:58 AM, Eugene Goltsman wrote:

Hello, I'm trying out DAMAPPER for mapping a set of unitigs to PacBio reads, and when parsing the results I notice that sometimes duplicate alignments to different parts of the same read are reported as chains, which seems like not the intended behavior. Perhaps this has to do with the fact that I'm aligning short sequences to long reads, and not vice versa? Here's an example from the LAdump -lc output:

P 6661 109492 n > L 83 12383 C 0 83 11853 11937 P 6661 109492 n - L 83 12383 C 0 83 11948 12032

Here, the query contig '83' aligned fully to two locations on read '109492', which should probably be reported as two separate LAs, and not as a chain.

Eugene G

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DAMAPPER/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNoLio_ sUBRqgZSWYgBJxHCsMDoWfks5r64mxgaJpZM4Nek-q.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DAMAPPER/issues/7#issuecomment-303765115, or mute the thread https://github.com/notifications/unsubscribe-auth/ADifIpDNrbXA-RBhODpJxejx5dNXOthoks5r9FBdgaJpZM4Nek-q .