thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
139 stars 61 forks source link

Trace point sum != aligned interval #43

Closed fbemm closed 7 years ago

fbemm commented 8 years ago

Dear Gene,

I am running daligner and LAcheck on P5-C3 data for two projects and repeatedly get the following error:

Trace point sum != aligned interval

Log output does not show any problems:

Building index for raw_reads.11

Kshift=28 BSHIFT=8 TooFrequent=16 (Kshift-1)/BSHIFT + (TooFrequent < INT32_MAX)=4 sizeof(KmerPos)=16 nreads=70565 Kmer=14 block->reads[nreads].boff=400076603 kmers=399088693 sizeof(KmerPos)*(kmers+1)=6385419104 Allocated 399088694 of 16 (6385419104 bytes) at 0x2acd77474010 Kmer count = 399,088,693 Using 11.89Gb of space Revised kmer count = 341,657,466 Index occupies 5.09Gb

Comparing raw_reads.47 to raw_reads.11

Capping mutual k-mer matches over 10000 (effectively -t100) Hit count = 1,035,740,743 Highwater of 36.00Gb space

 1,035,740,743 14-mers (6.473235e-09 of matrix)
     1,891,766 seed hits (1.182327e-11 of matrix)
     1,365,922 confirmed hits (8.536821e-12 of matrix)

Building index for c(raw_reads.11)

Kshift=28 BSHIFT=8 TooFrequent=16 (Kshift-1)/BSHIFT + (TooFrequent < INT32_MAX)=4 sizeof(KmerPos)=16 nreads=70565 Kmer=14 block->reads[nreads].boff=400076603 kmers=399088693 sizeof(KmerPos)*(kmers+1)=6385419104 Allocated 399088694 of 16 (6385419104 bytes) at 0x2acd77474010 Kmer count = 399,088,693 Using 11.89Gb of space Revised kmer count = 341,657,466 Index occupies 5.09Gb

Comparing raw_reads.47 to c(raw_reads.11)

Capping mutual k-mer matches over 10000 (effectively -t100) Hit count = 983,583,036 Highwater of 34.45Gb space

 983,583,036 14-mers (6.147256e-09 of matrix)
   1,608,346 seed hits (1.005194e-11 of matrix)
   1,154,751 confirmed hits (7.217032e-12 of matrix)

Is there fast forward way the see which read alignment is causing the issues?

Bests, Felix

pb-jchin commented 8 years ago

@fbemm @thegenemyers we also find this after we put code in FALCON to do routine LAcheck for the first stage overlapping. @fbemm You can temporary work around it by turning off the LAcheck for now in the particular jobs. The downstream FALCON code seems not to be affected in our experience.

pb-cdunn commented 8 years ago

Gene, I haven't had time to attempt to debug this yet. I have a test-case to reproduce this, but it's way too large to upload. Multiple users have seen it running LAcheck (some contacting me privately). If you have any ideas, please let us know.

fbemm commented 8 years ago

I uploaded a las file that throws the error. In case you need more, let me/us know!

https://owncloud.tuebingen.mpg.de/public.php?service=files&t=0f8c8403b8bef19c6a4d4c225106cdd1

thegenemyers commented 8 years ago

I can reproduce the problem and will let you know shortly if it is a bug in LAcheck. The other possibility is more ugly: daligner is producing a bad local alignment. But let's worry about that if it comes to it. -- Gene

On 7/8/16, 9:11 AM, Felix Bemm wrote:

I uploaded a las file that throws the error. In case you need more, let me/us know!

https://owncloud.tuebingen.mpg.de/public.php?service=files&t=0f8c8403b8bef19c6a4d4c225106cdd1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-231291178, or mute the thread https://github.com/notifications/unsubscribe/AGkkNp2chqFsJ7hTDaF_k7mCkwE3pRyWks5qTfgOgaJpZM4JHFMy.

thegenemyers commented 8 years ago

Dear Felix,

 Thanks for the .las file.  The reads that throw the error are

     1,576,561 and 2,935,085

The error is produced by daligner, LAcheck is simply revealing that a la-record is ill-formed. It has something to do with the large trace spacing (-s1000) used -- I have run several large vertebrate genomes without seeing this problem on the default spacing of -s100. Jason's observation that it had no effect down stream is because Falcon never tries to reconstruct an alignment from the tracepoints. LAshow -a should crash on this record!

 Could you extract the two reads above from raw.reads.db and send

them to me? It should be the case that running daligner on a database consisting of these two reads, and then LAcheck on the results will reproduce the error. If you could further confirm this that would be great.

 From there I should be able to then uncover the bug and fix it.

 Best regards,
     Gene

On 7/8/16, 9:11 AM, Felix Bemm wrote:

I uploaded a las file that throws the error. In case you need more, let me/us know!

https://owncloud.tuebingen.mpg.de/public.php?service=files&t=0f8c8403b8bef19c6a4d4c225106cdd1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-231291178, or mute the thread https://github.com/notifications/unsubscribe/AGkkNp2chqFsJ7hTDaF_k7mCkwE3pRyWks5qTfgOgaJpZM4JHFMy.

fbemm commented 8 years ago

Hey Gene,

that's what I did:

printf "1576561\n2935085" > debug.list DBshow raw_reads.db debug.list > debug.fasta fasta2DB debug debug.fasta daligner -v -t16 -H7000 -e0.7 -s1000 debug.db debug.db LAcheck -v debug.db *.las

Results:

debug.debug.C0: 0 all OK debug.debug.C1: 0 all OK debug.debug.C2: 0 all OK debug.debug.C3: 0 all OK debug.debug.N0: 0 all OK debug.debug.N1: 0 all OK debug.debug.N2: 0 all OK debug.debug.N3: 0 all OK

The two reads you mention above don't have any overlap so there is no record in the las files. I uploaded files to the cloud again. The whole db is 5Gb. Let me know if you need it!

Thanks, Felix

thegenemyers commented 8 years ago

Sorry, I forgot about internal/external indexing. You need to add 1 to each index, i.e. the reads in question are:

1,576,562 and 2,935,086

Please confirm they overlap and the error is reproduced. All I will need thereafter is debug.fasta.

 Thanks,
     Gene

On 7/9/16, 2:10 PM, Felix Bemm wrote:

Hey Gene,

that's what I did:

|printf "1576561\n2935085" > debug.list| |DBshow raw_reads.db debug.list > debug.fasta| |fasta2DB debug debug.fasta| |daligner -v -t16 -H7000 -e0.7 -s1000 debug.db debug.db| |LAcheck -v debug.db *.las|

Results:

|debug.debug.C0: 0 all OK| |debug.debug.C1: 0 all OK| |debug.debug.C2: 0 all OK| |debug.debug.C3: 0 all OK| |debug.debug.N0: 0 all OK| |debug.debug.N1: 0 all OK| |debug.debug.N2: 0 all OK| |debug.debug.N3: 0 all OK|

The two reads you mention above don't have any overlap so there is no record in the las files. I uploaded files to the cloud again. The whole db is 5Gb. Let me know if you need it!

Thanks, Felix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-231531400, or mute the thread https://github.com/notifications/unsubscribe/AGkkNpYevnF_7y9VOLjD4AFajdmVYsqGks5qT4-4gaJpZM4JHFMy.

fbemm commented 8 years ago

I can confirm that the reads somehow overlap (both have a larger GC rich region) but I can't reproduce the error. LAcheck says ok on both, unsorted and sorted .las files. Pushed debug.fasta to the cloud again but probably this does not help.

thegenemyers commented 8 years ago

Hi Felix,

 I was able to confirm that the reads were the correct ones. They 

both have low complexity regions/micro satellites that result in many local alignments between these two reads. Unfortunately, the alignment that was broken in the original run is not broken when only these two reads constitute the data set. Specifically, the trace point sum was 25 less than the B-interval of the alignment in the original case, but now the B-interval is 25 less so the two correspond !

 So unfortunately something subtle is going on and we need to create 

the problem with a "small" data set if possible. Could I ask you to basically create a data base consisting of the reads in block raw_reads.11, i.e.

 DBshow raw_reads.11 >debug11.fasta
 fast2DB raw11 debug11.fasta

and another of block raw_reads.47 and then see if "daligner ... raw47 raw11" creates a .las file that exhibits the LAcheck problem? (or perhaps the concatenation of the two blocks into one split database, or "daligner ... raw11 raw 47"). If one of these recreates the problem, then all you would need to pass me is the two blocks. Otherwise we would have to go to square 1 and I would need to download the entire DB (which I really hope we can avoid)

-- Gene

On 7/10/16, 4:03 PM, Felix Bemm wrote:

I can confirm that the reads somehow overlap (both have a larger GC rich region) but I can't reproduce the error. LAcheck says ok on both, unsorted and sorted .las files. Pushed debug.fasta to the cloud again but probably this does not help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-231590641, or mute the thread https://github.com/notifications/unsubscribe/AGkkNkFI4eFnBYbSVMMG-VhZ1msfbyjxks5qUPuigaJpZM4JHFMy.

fbemm commented 8 years ago

The 1-block attempt (raw11 vs raw47 without DBsplit -x2) shows a new LAckeck message:

raw11.raw47.C0: 1,510,458 all OK raw11.raw47.C0.S: 1,510,458 all OK raw11.raw47.C1: 1,441,800 all OK raw11.raw47.C1.S: 1,441,800 all OK raw11.raw47.C2: 1,400,376 all OK raw11.raw47.C2.S: 1,400,376 all OK raw11.raw47.C3: 1,366,367 all OK raw11.raw47.C3.S: 1,366,367 all OK raw11.raw47.N0: 1,688,069 all OK raw11.raw47.N0.S: 1,688,069 all OK raw11.raw47.N1: 1,625,219 all OK raw11.raw47.N1.S: 1,625,219 all OK raw11.raw47.N2: 1,590,018 all OK raw11.raw47.N2.S: 1,590,018 all OK raw11.raw47.N3: 1,559,125 all OK raw11.raw47.N3.S: 1,559,125 all OK raw47.raw11.C0: Non-sense alignment intervals raw47.raw11.C0.S: Non-sense alignment intervals raw47.raw11.C1: Non-sense alignment intervals raw47.raw11.C1.S: Non-sense alignment intervals raw47.raw11.C2: Non-sense alignment intervals raw47.raw11.C2.S: Non-sense alignment intervals raw47.raw11.C3: Non-sense alignment intervals raw47.raw11.C3.S: Non-sense alignment intervals raw47.raw11.N0: Non-sense alignment intervals raw47.raw11.N0.S: Non-sense alignment intervals raw47.raw11.N1: Non-sense alignment intervals raw47.raw11.N1.S: Non-sense alignment intervals raw47.raw11.N2: Non-sense alignment intervals raw47.raw11.N2.S: Non-sense alignment intervals raw47.raw11.N3: Non-sense alignment intervals raw47.raw11.N3.S: Non-sense alignment intervals

I uploaded two fasta files to subfolder 1-block.

fbemm commented 8 years ago

Forget the last comment. I missed to set the right parameters.

fbemm commented 8 years ago

I get the error when working on one db splitted into two blocks.

debug.debug.C0: 304,866 all OK debug.debug.C0.S: 304,866 all OK debug.debug.C1: 295,708 all OK debug.debug.C1.S: 295,708 all OK debug.debug.C2: 302,193 all OK debug.debug.C2.S: 302,193 all OK debug.debug.C3: 295,294 all OK debug.debug.C3.S: 295,294 all OK debug.debug.N0: 436,714 all OK debug.debug.N0.S: 436,714 all OK debug.debug.N1: 418,099 all OK debug.debug.N1.S: 418,099 all OK debug.debug.N2: Trace point sum != aligned interval debug.debug.N2.S: Trace point sum != aligned interval debug.debug.N3: 430,350 all OK debug.debug.N3.S: 430,350 all OK

fbemm commented 8 years ago

I don't get the error when I just used the two debug databases but receive the messages (Non-sense alignment intervals) from before.

thegenemyers commented 8 years ago

Felix,

Thanks for the two blocks of sequence.  I was able to reproduce the 

error a few minutes ago, so now its just a matter of my figuring out what the bug is ;-)

 On the "non-sense alignment intervals", I'm guessing that the 

problem is that you are not calling LAcheck with both databases. That is, if you compare database X against database Y, you will get files X.Y..las and also files of the form Y.X..las. You need to call "LAcheck X.db Y.db X.Y..las" and "LAcheck Y.db X.db Y.X..las". Both X.db and Y.db must be arguments and in the right order for the particular .las file. If this is not the problem please let me know.

-- Gene

On 7/13/16, 2:10 PM, Felix Bemm wrote:

I don't get the error when I just used the two debug databases but receive the messages (Non-sense alignment intervals) from before.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-232336255, or mute the thread https://github.com/notifications/unsubscribe/AGkkNna7yPLNIhGGGbIekT-piq4XDoDbks5qVNWogaJpZM4JHFMy.

pb-cdunn commented 8 years ago

Gene, Any update? There is a lot of interest.

thegenemyers commented 8 years ago

Sorry I was concentrating on getting the mapper out. I do have a data set now that produces the bug, so its just a question of going at it. I'll get on it today. -- Gene

On 8/2/16, 5:35 PM, Christopher Dunn wrote:

Gene, Any update? There is a lot of interest.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-236943282, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNpmGUbnxVI5BEBfmyM31EDInca2Nks5qb2PDgaJpZM4JHFMy.

thegenemyers commented 8 years ago

I finally found the damn bug (I think). I introduced it about 6 weeks ago (coincidentally in making another fix for Chris :-) ).

I just committed the fixed version of the code. This bug also affected damapper (just released) and the datander (from the DAMASKER module), i.e. all the daligner-based variants.

Could y'all let me know if any problems persist. The fix I made removed the problem from the test case I have from Felix and would explain the general problem, including why it happens more often when the trace spacing is large (i.e. -s1000). But I'd like to get an "all clear" from you guys.

Thanks for your patience,
    Gene

On 7/13/16, 2:10 PM, Felix Bemm wrote:

I don't get the error when I just used the two debug databases but receive the messages (Non-sense alignment intervals) from before.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-232336255, or mute the thread https://github.com/notifications/unsubscribe/AGkkNna7yPLNIhGGGbIekT-piq4XDoDbks5qVNWogaJpZM4JHFMy.

pb-cdunn commented 8 years ago

This is great news! I will test it by Monday at the latest.

pb-cdunn commented 8 years ago

After re-running dalinger, LAcheck worked! That was a previously failing test-case.

thegenemyers commented 8 years ago

Does LAcheck crash or report an error? It should be reporting an error, and that was the phenomenon Felix complained about. Do you really mean "crash"? Please advise.

And yes, things need to be rerun -- the trace point encoding that was being produced by daligner was indeed incorrect for a small handful of LA records, and any down stream process that tries to reconstruct an alignment (e.g. LAshow -a) from those will crash. The only reason you weren't getting a crash down stream is that you (so far) ignore trace points.

If the runs are too big to redo and you don't care about the fact that the trace points are not correct in a few LA records, then one could turn off this integrity check in LAcheck.c. Simply remove the call to "Check_Trace_Points" (lines 241&242).

-- Gene

On 8/6/16, 6:20 PM, Christopher Dunn wrote:

|LAcheck| itself still crashes on the old |.las| files. Would I need to re-run |daligner|, re-creating the |.las| files first?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/43#issuecomment-238031535, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNv1y8bbPH4b6fakh3IMyI-OIsBNdks5qdLRSgaJpZM4JHFMy.

pb-cdunn commented 8 years ago

LAcheck was reporting an error. Not a crash. Sorry for confusion.

Thanks for the tip. But to clarify, we see no problems now. Danke!