thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
139 stars 61 forks source link

LAcheck: Duplicate overlap #42

Closed pb-cdunn closed 7 years ago

pb-cdunn commented 8 years ago
+ LAcheck -v rawreads.db rawreads.3.rawreads.12.C3.las
  rawreads.3.rawreads.12.C3: Duplicate overlap (13481 vs 54643)

Test-data available at:

(Took 20min to to upload, but should quicker for you to download.)

pb-cdunn commented 8 years ago

Adding @pb-jchin.

pb-jchin commented 8 years ago

Just a comment. The FALCON overlap to graph only takes first pair of overlap. Duplicate records will not break FALCON's consensus and overlap-and-graph module.

thegenemyers commented 8 years ago

Chris,

Sorry for the long delay in responding but I've had a bit of travel 

and the fix was a little deeper than it might at first appear.

I committed new code that should fix the problem for you.

But you should also be advised that using a trace-point spacing of 

1000 (-s1000) when the minimum local alignment is also 1000 has some drawbacks. While you save a factor of 5 (not 10 for complex reasons) in disk space, you have lost the chance to compute intrinsic quality values at a granularity that is useful.
Also, when looking for overlapping alignments (as well as duplicates), the daligner looks for alignments that share a trace point. The duplicate alignment reported had only 2 trace points near the ends of the rather short alignment involved and neither one matched. When I reran the thing with -s100 there was no problem. However, I have fixed the problem by considering end-points of alignments to be "trace-points".

Cheers,
    Gene

On 6/15/16, 11:01 PM, Christopher Dunn wrote:

  • LAcheck -v rawreads.db rawreads.3.rawreads.12.C3.las rawreads.3.rawreads.12.C3: Duplicate overlap (13481 vs 54643)

Test-data available at:

(Might take 20min to download. Not sure.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/42, or mute the thread https://github.com/notifications/unsubscribe/AGkkNvq-Q278TAACn8dl0SNQ4NzejW_kks5qMGg8gaJpZM4I2ykP.

pb-cdunn commented 8 years ago

Thanks for that clarification. We might (I hope) switch to shorter trace-points so that we don't miss alignments.

And thank you for solving this tricky problem. Much appreciated. We had turned off LAcheck pending a solution.

pb-jchin commented 8 years ago

we use -s100 for most projects. The current FALCON pipeline does not use trace point. If -s100 is better for checking, we can always use -s100.

pb-jchin commented 8 years ago

a side note, while the consensus module used in FALCON does not use the trace point now, it will be eventually useful to have the alignment end point information. The consensus module does its own O(ON) alignment. A k-mer match table and binning k-mer are used to find the begin and the end of the alignment with in the consensus module. If the begin and the end points are known, it can save some small amount of computation.

pb-cdunn commented 7 years ago

We're seeing this again. We're trying to get another test-case from a user to reproduce it, but it's probably large. (PacificBiosciences/FALCON-integrate#103)

Can you think of any other cause for this?

thegenemyers commented 7 years ago

If you mean the duplicate overlap report, there could be another bug or missed consideration albeit it seems unlikely. I have recently run full-scale data sets without a hiccup.

In the #103 the problem appeared to be a core-dump. Could it again be the small /tmp issue? If not, then we are back to needing to exchange an example of the phenomenon that allows me to check what's going on.

-- Gene

On 11/18/16, 1:07 AM, Christopher Dunn wrote:

We're seeing this again. We're trying to get another test-case from a user to reproduce it, but it's probably large. (PacificBiosciences/FALCON-integrate#103 https://github.com/PacificBiosciences/FALCON-integrate/issues/103)

Can you think of any other cause for this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/42#issuecomment-261410000, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNti-3xSlEWfviAZosZt0cruwKEwKks5q_Ow_gaJpZM4I2ykP.