2Gbp limit on DB blocks?

pbnjay commented 5 years ago

We're testing daligner on a 13 gigabase genome de novo assembly, but daligner gives me the following error:

daligner2.0: Fatal error, DB blocks are greater than 2Gbp!

Is there documentation somewhere on the limits? We have 41M reads, I tried just a subset of 10M reads ~ 107Gbp total, but it has the same error as above. I've also tried different DBsplit sizes but the error above doesn't seem to be part of that code path.

This system has 2TB ram so it can use quite a bit more memory if that is the concern.

pbnjay commented 5 years ago

To provide just a bit more info about our situation, here are the totals:

 Statistics for all reads in the data set                                                                                                                                                                                                 

      32,255,959 reads        out of      41,818,118  ( 77.1%)                                                                                                                                                                            
 355,794,256,224 base pairs   out of 441,985,913,537  ( 80.5%)

Based on the 2Gbp number, I'd have to do something like 150-200 separate runs, correct?

thegenemyers commented 5 years ago

You need to call DBsplit to partition the database into blocks.
daligner is not a "monolithic" application where you just call it on the data. You have to split the DB into blocks, that will be the unit of parallelism on your cluster runs, and you can use HPC.daligner to produce a script of commands that will compare all the blocks against each other. -- Gene

On 4/30/19, 9:00 PM, Jeremy Jay wrote:

We're testing daligner on a 13 gigabase genome de novo assembly, but |daligner| gives me the following error:
daligner2.0: Fatal error, DB blocks are greater than 2Gbp!
Is there documentation somewhere on the limits? We have 41M reads, I tried just a subset of 10M reads ~ 107Gbp total, but it has the same error as above. I've also tried different DBsplit sizes but the error above doesn't seem to be part of that code path.

This system has 2TB ram so it can use quite a bit more memory if that is the concern.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/88, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUSINR5ASFRYKYM7Q56PB3PTCJONANCNFSM4HJPKIMA.

pbnjay commented 5 years ago

Thanks for the response! Yes, I'm calling DBsplit on these files, and running the first pair of blocks of each as a test run.

The default 200mb split size is clearly too small - it only uses ~50gb of ram per job and spends a lot of time on I/O. Also 1779 blocks and 396k jobs.

I have done 1000mb split size but it is still small - it only uses ~154gb of ram. 356 blocks and 16k jobs. Quite a bit better better but still under 10% of available memory.

I have tried 1800mb split size, which is 198 blocks and 5k jobs so getting more reasonable. It allocates about 180gb ram and starts the "Comparing" stage but then segfaults. I'm guessing the 2Gbp number is an estimate and I'm hitting the true limit here?

I would love to get to around 500gb allocation, any split sizes 3200mb and up give the error message above.

thegenemyers commented 5 years ago

Looks like you need to do repeat masking. Your genome seems highly repetitive. -- Gene

On 5/2/19, 6:58 PM, Jeremy Jay wrote:

Thanks for the response! Yes, I'm calling DBsplit on these files, and running the first pair of blocks of each as a test run.

The default 200mb split size is clearly too small - it only uses ~50gb of ram per job and spends a lot of time on I/O. Also 1779 blocks and 396k jobs.

I have done 1000mb split size but it is still small - it only uses ~154gb of ram. 356 blocks and 16k jobs. Quite a bit better better but still under 10% of available memory.

I have tried 1800mb split size, which is 198 blocks and 5k jobs so getting more reasonable. It allocates about 180gb ram and starts the "Comparing" stage but then segfaults. I'm guessing the 2Gbp number is an estimate and I'm hitting the true limit here?

I would love to get to around 500gb allocation, any split sizes 3200mb and up give the error message above.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/88#issuecomment-488750538, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUSINVQJFL7RSY52AQ2JATPTMMS3ANCNFSM4HJPKIMA.

pbnjay commented 5 years ago

Yes it's a highly repetitive hexaploid genome, and we have an initial assembly (which is highly collapsed). So it would be pointless to mask at this point.

I'm happy to dig into the code but was just hoping for some explanation of the limits. It's difficult to tell if this is just an implementation specific issue or a problem inherent to the algorithm itself when applied to data of this scale.

thegenemyers / DALIGNER

2Gbp limit on DB blocks? #88