OOM prevents program completion

nclement commented 5 years ago

Issue tracker is ONLY used for reporting bugs.

Expected Behavior

If a thread exceeds memory, one of two actions would be expected:

the program handles it gracefully, and reduces the number of cores used
the program immediately terminates, suggesting the user reduce the number of threads

Current Behavior

The threads cannot continue ("OSError: [Errno 12] Cannot allocate memory" followed by "MemoryError"), but the program continues until completion of the currently-running threads, then reports "[lightdock] ERROR: Lightdock has failed, please check traceback:"

Possible Solution

Automatically reduce the number of threads if one of them dies.

Steps to Reproduce

Download 1Y64_r_u.pdb and 1Y64_l_u.pdb from Zlab benchmark 5, remove all HETATM
lightdock_setup -anm 1Y64_r_u.pdb 1Y64_l_u.pdb 400 300
lightdock setup.json -c 100 100
failure

Context (Environment)

Current supercomputer nodes have 48 cores, with two hardware threads per core, leading to 96 hardware threads per node. The amount of ram is 192GB, but this is apparently insufficient for 96 threads, so the program crashes. I've tried reducing this to 10 cores, but this still finishes unsuccessfully. I could try only a single core, but then I imagine this would take too long to even finish and I'm not sure I want to do this.

Ideally, this would complete without having to restart the job (on a supercomuting environment, re-submitting the job takes 12-24 hours, so the turnaround is extremely slow).

Detailed Description

Possible Implementation

brianjimenez commented 5 years ago

Can not reproduce this behavior in a Macbook Pro laptop with 8 cores and 16GB RAM using -c 100. System becomes unstable, but not crashing on that exception. While I try to reproduce it in a cluster with similar configuration, please try the following workaround:

Change -c 100 to -c 48
Use fastdfire scoring option with -s fastdfire

Command line should look like: lightdock setup.json 100 -c 48 -s fastdfire

Important to note that if distributed memory is used, -mpi flag should be enabled, despite it is an experimental option which has not been properly tested in the last releases.

nclement commented 5 years ago

I'm guessing this is because your Macbook Pro makes use of virtual memory, where the supercompute nodes are configured to use 0 virtual memory (hard limit--causes the Out of Memory error seen above).

I'm still getting the same issue with only 48 cores and the fastdfire energy model. Is there any workaround? Can I get the program to die early if it runs out of cores? Or even restart with half the number of cores if this happens?

I'd like to automate this and run quite a few samples (~100), so repeating this process by hand is extremely expensive.

Any help you can get would be great!

On Thu, Mar 28, 2019 at 3:03 PM Brian Jimenez notifications@github.com wrote:

Closed #16 https://github.com/brianjimenez/lightdock/issues/16.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/brianjimenez/lightdock/issues/16#event-2236584134, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYK_v8oLnBy-fA68OUbYNiatUNO-w5qks5vbNnZgaJpZM4cPwW7 .

brianjimenez commented 5 years ago

LightDock has been tested in several supercomputers and clusters and never got this problem. In the 4G6M example we have a couple of examples of sending simulation and analysis jobs to the queue (https://brianjimenez.github.io/lightdock/4G6M.html). I think this will depend on your cluster architecture, if you don't mind to share the details with me, I could give you some insights. We can continue the conversation here or by email.

lightdock / lightdock-python2.7