Closed nclement closed 5 years ago
Can not reproduce this behavior in a Macbook Pro laptop with 8 cores and 16GB RAM using -c 100
. System becomes unstable, but not crashing on that exception. While I try to reproduce it in a cluster with similar configuration, please try the following workaround:
-c 100
to -c 48
-s fastdfire
Command line should look like:
lightdock setup.json 100 -c 48 -s fastdfire
Important to note that if distributed memory is used, -mpi
flag should be enabled, despite it is an experimental option which has not been properly tested in the last releases.
I'm guessing this is because your Macbook Pro makes use of virtual memory, where the supercompute nodes are configured to use 0 virtual memory (hard limit--causes the Out of Memory error seen above).
I'm still getting the same issue with only 48 cores and the fastdfire energy model. Is there any workaround? Can I get the program to die early if it runs out of cores? Or even restart with half the number of cores if this happens?
I'd like to automate this and run quite a few samples (~100), so repeating this process by hand is extremely expensive.
Any help you can get would be great!
On Thu, Mar 28, 2019 at 3:03 PM Brian Jimenez notifications@github.com wrote:
Closed #16 https://github.com/brianjimenez/lightdock/issues/16.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/brianjimenez/lightdock/issues/16#event-2236584134, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYK_v8oLnBy-fA68OUbYNiatUNO-w5qks5vbNnZgaJpZM4cPwW7 .
LightDock has been tested in several supercomputers and clusters and never got this problem. In the 4G6M example we have a couple of examples of sending simulation and analysis jobs to the queue (https://brianjimenez.github.io/lightdock/4G6M.html). I think this will depend on your cluster architecture, if you don't mind to share the details with me, I could give you some insights. We can continue the conversation here or by email.
Issue tracker is ONLY used for reporting bugs.
Expected Behavior
If a thread exceeds memory, one of two actions would be expected:
Current Behavior
The threads cannot continue ("OSError: [Errno 12] Cannot allocate memory" followed by "MemoryError"), but the program continues until completion of the currently-running threads, then reports "[lightdock] ERROR: Lightdock has failed, please check traceback:"
Possible Solution
Automatically reduce the number of threads if one of them dies.
Steps to Reproduce
Context (Environment)
Current supercomputer nodes have 48 cores, with two hardware threads per core, leading to 96 hardware threads per node. The amount of ram is 192GB, but this is apparently insufficient for 96 threads, so the program crashes. I've tried reducing this to 10 cores, but this still finishes unsuccessfully. I could try only a single core, but then I imagine this would take too long to even finish and I'm not sure I want to do this.
Ideally, this would complete without having to restart the job (on a supercomuting environment, re-submitting the job takes 12-24 hours, so the turnaround is extremely slow).
Detailed Description
Possible Implementation