caf requesting large amounts of memory

ens-bwalts commented 9 years ago

I'm trying to run ProgressiveCactus on the blanchette test data, and on our system it's requesting a huge amount of memory:

Got message from job at time: 1427464521.1 : Starting caf phase target with index 0 at 1427464521.06 seconds (recursing = 1)
...
RuntimeError: Requesting more memory than available. Requested: 137438953472, Available: 35359738368

I'm running this on a single node (the cluster it's on runs LSF, but I'm letting --bigBatchSystem default to none). The command line I'm using is

bin/runProgressiveCactus.sh --logDebug --logFile=/gpfs/nobackup/cactus2/out/ep1_test.log --stats --maxThreads=4 --maxCpus=4 --maxMemory=35359738368 --database=tokyo_cabinet examples/blanchette00.txt /gpfs/nobackup/cactus2/work/ /gpfs/nobackup/cactus2/out/blanchette.hal

It seems like it may be getting confused on how much memory it needs. I've run the sample data on a different machine with default --maxMemory. Any ideas on what may be causing caf to overestimate its memory requirements so badly?

benedictpaten commented 9 years ago

Hmm, not sure what's going on.. if I'm reading that right it's requesting 137 gigs, which is obviously crazy for that small test example. Maybe Joel knows what's going on.

On Mon, Mar 30, 2015 at 6:37 AM, ens-bwalts notifications@github.com wrote:

I'm trying to run ProgressiveCactus on the blanchette test data, and on our system it's requesting a huge amount of memory:

Got message from job at time: 1427464521.1 : Starting caf phase target with index 0 at 1427464521.06 seconds (recursing = 1) ... RuntimeError: Requesting more memory than available. Requested: 137438953472, Available: 35359738368

I'm running this on a single node (the cluster it's on runs LSF, but I'm letting --bigBatchSystem default to none). The command line I'm using is

bin/runProgressiveCactus.sh --logDebug --logFile=/gpfs/nobackup/cactus2/out/ep1_test.log --stats --maxThreads=4 --maxCpus=4 --maxMemory=35359738368 --database=tokyo_cabinet examples/blanchette00.txt /gpfs/nobackup/cactus2/work/ /gpfs/nobackup/cactus2/out/blanchette.hal

It seems like it may be getting confused on how much memory it needs. I've run the sample data on a different machine with default --maxMemory. Any ideas on what may be causing caf to overestimate its memory requirements so badly?

— Reply to this email directly or view it on GitHub https://github.com/glennhickey/progressiveCactus/issues/36.

joelarmstrong commented 9 years ago

I think this is because we (by default) request a huge amount of memory from the batch system for the caf phase. In this case it's requesting a generous amount that should cover even a mammal alignment. You can adjust the config in cactus_progressive_config.xml, changing the value in the "bigMemory" attribute in the defines tag to something much lower--although I'd suggest leaving it quite high if you're going to be aligning whole genomes.

We've considered scaling the memory request by length of the genomes involved, but the memory usage is actually quite unpredictable due to differences in assembly quality. It sucks to lose an alignment because you reserved too little memory, so it's set very high to cover almost all cases.

ens-bwalts commented 9 years ago

Changing bigMemory in cactus_progressive_config.xml worked. Thanks!

However, I played around with it a bit more, and it looks like there may be something wrong with setting --maxMemory on the command line: 1) in cactus_progressive_config.xml, set bigMemory to 137438953472 2) runProgressiveCactus without setting --maxMemory on the command line Result: successful run.

However, setting --maxMemory=34359738369 (mediumMemory + 1) results in a failure where caf requests bigMemory bytes.

joelarmstrong commented 9 years ago

I think that's due to the separation between our scheduling system (jobTree) and cactus itself: the "maxMemory" is a jobTree option that just means that jobs will fail if they ask for more than X bytes of memory. The bytes of memory the jobs ask for is still controlled by the config file despite the limit, so the jobs just fail.

Anyway, sorry for the annoyance. Hopefully the workaround of modifying the config file works adequately.

glennhickey / progressiveCactus

caf requesting large amounts of memory #36