thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
138 stars 61 forks source link

LAsplit produces empty la's? #63

Closed spock closed 7 years ago

spock commented 7 years ago
$ LAsplit -v hinge.# 80 < hinge.las
  Distributing 22537470841 la's
  Split off hinge.1: 281899173 la's
  Split off hinge.2: 281787187 la's
  Split off hinge.3: 281578571 la's
  Split off hinge.4: 281853746 la's
  Split off hinge.5: 281560591 la's
  Split off hinge.6: 281660895 la's
  Split off hinge.7: 281978106 la's
  Split off hinge.8: 0 la's
  Split off hinge.9: 0 la's
...
  Split off hinge.36: 0 la's
  Split off hinge.37: 0 la's
  Split off hinge.38: 143176638 la's
  Split off hinge.39: 0 la's

and so on. Tried several times.

I'm using LAsplit from commit 9e9acd358d2d8b6d.

Files are 12 bytes, checking them succeeds:

$ LAcheck -vS hinge.db hinge.8.las
  hinge.8: 0 all OK

The input hinge.las is the result of HPC.daligner, LAmerge, and DASqv. The only unusual thing is that I ran DBdust but haven't supplied -mdust to HPC.daligner.

What could be wrong, and/or how do I debug this problem further?

Update 1: going back in time, now checking if commits f424c185e6a81, 0430011a0fd42f9, 1c5d470fcbe4a8d9f, c7fa67830d24 give me the same problem: yes (actually, the earliest commits seem to simply not stop at any size, and keep writing data to the first file - haven't waited to see what happens next). I guess the problem is in my dataset, then.

thegenemyers commented 7 years ago

I believe the problem is that some variables are int32 and they need to be int64 as there are 10+billion alignments in hinge.las. I will fix it soon, but could you try the following easy patch for me:

replace 'int32' with 'in64' in line 144 of LAsplit.c (and then recompile).

Let me know if that doesn't fix it. -- Gene

On 6/30/17, 10:54 PM, Bogdan wrote:

$ LAsplit -v hinge.# 80 < hinge.las Distributing 22537470841 la's Split off hinge.1: 281899173 la's Split off hinge.2: 281787187 la's Split off hinge.3: 281578571 la's Split off hinge.4: 281853746 la's Split off hinge.5: 281560591 la's Split off hinge.6: 281660895 la's Split off hinge.7: 281978106 la's Split off hinge.8: 0 la's Split off hinge.9: 0 la's ... Split off hinge.36: 0 la's Split off hinge.37: 0 la's Split off hinge.38: 143176638 la's Split off hinge.39: 0 la's

and so on. Tried several times.

I'm using LAsplit from commit 9e9acd3 https://github.com/thegenemyers/DALIGNER/commit/9e9acd358d2d8b6d24769f58f7de991c47292ce2.

Files are 12 bytes, checking them succeeds:

$ LAcheck -vS hinge.db hinge.8.las hinge.8: 0 all OK

The input |hinge.las| is the result of HPC.daligner, LAmerge, and DASqv. The only unusual thing is that I ran DBdust but haven't supplied |-mdust| to HPC.daligner.

What could be wrong, and/or how do I debug this problem further?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNnF_IML1B_9tXOLlCaQVAnYBNBp5ks5sJWCGgaJpZM4OLCrG.

spock commented 7 years ago

This change seems to fix the "zero-size" issue.

I now have a different problem, but I'm not sure if it is data-related or "zero-size fix"-related.

The command now runs ok, initially:

$ time LAsplit -v /tmp/hinge.# 80 < hinge.las
  Distributing 22537470841 la's
  Split off hinge.1: 281899173 la's
  Split off hinge.2: 281787187 la's
...
  Split off hinge.7: 281978106 la's
  Split off hinge.8: 281585190 la's

However, it was stuck for a long time after hinge.8, and I had a look at the size of parts so far:

$ ls -s1h hinge.*
 25G hinge.1.las
 25G hinge.2.las
...
 25G hinge.7.las
 25G hinge.8.las
100G hinge.9.las

At this point I killed the process.

I also tried splitting into 2x more fragments, but encountered the same problem at around the same data size:

$ ls -vs1h hinge.*
13G hinge.1.las
13G hinge.2.las
...
13G hinge.15.las
13G hinge.16.las
61G hinge.17.las

I had a 2x smaller .las file laying around (same input data, but 2kb overlaps), so I tried on that one, too:

$ time LAsplit -v /tmp/h2.# 80 < h2.las
  Distributing 7472368456 la's
  Split off h2.1: 93533371 la's
  Split off h2.2: 93299171 la's
...
  Split off h2.21: 93432568 la's
  Split off h2.22: 93398464 la's
  Split off h2.23: 93417164 la's

I had to kill the process again:

$ ls -vs1h h2.*
11G h2.1.las
11G h2.2.las
11G h2.3.las
...
44G h2.24.las

I no longer need LAsplit functionality, but can test further changes to find the cause of this new problem.

spock commented 7 years ago

Update: I just tried to split the smaller file "by the database", and it doesn't seem to have this problem of extra-large parts:

$ time LAsplit -v h2.# h2.db < h2.las
  Distributing 7472368456 la's
  Split off h2.1: 232855070 la's
  Split off h2.2: 241411251 la's
...
  Split off h2.9: 239012941 la's
  Split off h2.10: -4044725815 la's
  Split off h2.11: 250723547 la's
...

The file sizes in this case look fine:

$ ls -vs1h h2.*
26G h2.1.las
27G h2.2.las
27G h2.3.las
28G h2.4.las
29G h2.5.las
28G h2.6.las
28G h2.7.las
26G h2.8.las
27G h2.9.las
28G h2.10.las
28G h2.11.las
...
thegenemyers commented 7 years ago

I think this is due to overflow of 32-bit integers (note the negative number for h2.10). I've checked in a new version that should fix the bug. Please let me know if not. -- Gene

On 7/28/17, 4:17 PM, Bogdan wrote:

Update: I just tried to split the smaller file "by the database", and it doesn't seem to have this problem of extra-large parts:

$ time LAsplit -v h2.# h2.db < h2.las Distributing 7472368456 la's Split off h2.1: 232855070 la's Split off h2.2: 241411251 la's ... Split off h2.9: 239012941 la's Split off h2.10: -4044725815 la's Split off h2.11: 250723547 la's ...

The file sizes in this case look fine:

$ ls -vs1h h2.* 26G h2.1.las 27G h2.2.las 27G h2.3.las 28G h2.4.las 29G h2.5.las 28G h2.6.las 28G h2.7.las 26G h2.8.las 27G h2.9.las 28G h2.10.las 28G h2.11.las ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/63#issuecomment-318663845, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNogFXAoZV8XyJWN6Oj24Zxj-O46Nks5sSe16gaJpZM4OLCrG.