thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
138 stars 61 forks source link

Unmanageable output size (and runtime) for daligner on a large dataset #78

Closed jwhitney31337 closed 6 years ago

jwhitney31337 commented 6 years ago

I'm trying to run daligner on a database of about 100X human PacBio Sequel data, using 60 HPC nodes with 20 cores/40 threads and 256GB or RAM each. Not realizing just how long the jobs were going to take, I may not have used optimal splitting/sizing of the database and jobs, but I've kept the nodes busy for over two weeks and they're only 20-30% done by my estimation. More specific details below. My immediate concern is that the .las files produced by the overlap jobs already number around 250k and total nearly 110TB so far! Most of these files are between 300-500MB so I fear I need to have 400TB of space available for the complete set of alignments, which will be difficult to arrange.

My .db is about 62GB on disk and contains ~30M reads. I used DBsplit with no (that is, default) parameters: $ DBsplit MYDB.db I generated 211575 overlap jobs sized to use 20GB of memory as follows: $ HPC.daligner -M20 -fMYDBscript MYDB.db $ wc -l MYDBscript.01.OVL 211576 I divided these into 498 batches and submitted to our cluster, requesting 30GB (20GB was apparently not enough, got many memory errors) of vmem each.

I understand there may not be much we can do about the long runtime, but I'm hoping the size of the alignments might be reduced somehow without losing much useful information.

The documentation and other posts I've read so far on Gene's blog doesn't mention unmanageably large .las files. I would love to learn that I'm making a very elementary mistake!

I can provide more stats from the data if relevant.

thegenemyers commented 6 years ago

It sounds like you are not doing repeat masking. This is essential. See https://dazzlerblog.wordpress.com/2016/04/01/detecting-and-soft-masking-repeats -- Gene

On 4/11/18, 1:20 AM, Joe Whitney wrote:

I'm trying to run daligner on a database of about 100X human PacBio Sequel data, using 60 HPC nodes with 20 cores/40 threads and 256GB or RAM each. Not realizing just how long the jobs were going to take, I may not have used optimal splitting/sizing of the database and jobs, but I've kept the nodes busy for over two weeks and they're only 20-30% done by my estimation. More specific details below. My immediate concern is that the .las files produced by the overlap jobs already number around 250k and total nearly 110TB so far! Most of these files are between 300-500MB so I fear I need to have 400TB of space available for the complete set of alignments, which will be difficult to arrange.

My .db is about 62GB on disk and contains ~30M reads. I used DBsplit with no (that is, default) parameters: $ DBsplit MYDB.db I generated 211575 overlap jobs sized to use 20GB of memory as follows: $ HPC.daligner -M20 -fMYDBscript MYDB.db $ wc -l MYDBscript.01.OVL 211576 I divided these into 498 batches and submitted to our cluster, requesting 30GB (20GB was apparently not enough, got many memory errors) of vmem each.

I understand there may not be much we can do about the long runtime, but I'm hoping the size of the alignments might be reduced somehow without losing much useful information.

The documentation and other posts I've read so far on Gene's blog doesn't mention unmanageably large .las files. I would love to learn that I'm making a very elementary mistake!

I can provide more stats from the data if relevant.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/78, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNjTWqS9iqWIj03IYxoihsY3JSJTjks5tnT5RgaJpZM4TPJLq.