Accumulation of memory or memory leak?

chillenzer commented 1 year ago

Hi, is there any part of the program that is expected accumulate a significant amount of memory during a run? I had my recent jobs on sunbird killed by oom-kill after a few hundred configurations despite starting off fine. What particularly bothered me was the fact that simulations on very different lattices (implying very different memory consumption) started off fine with the same memory requested and all were kill after some time. This suggests that it is indeed an accumulation over time and not just insufficient resources for too big jobs. If there is no such thing expected, I'm afraid there's a memory leak and I will have to debug that at some point.

mmesiti commented 1 year ago

I remember seeing this problem.already and briefly investigating it (it was some years ago, though). There are obvious suspects, like the allocate calls here and there in the code. I think I checked those but could not find anything wrong with them. I did not investigate further because just increasing the memory limit would solve the problem.

chillenzer commented 1 year ago

Okay. Thanks. Might have a look later. What memory needs should I expect?

mmesiti commented 1 year ago

I think that on Sunbird and on the Cambridge using (n_used_cores_on_node/n_node_cores)*node_memory was working fine. By the way, I do agree that the situation is not nice and the issue should be fixed, it was just a matter of priorities.

sa2c / thirring-rhmc

Accumulation of memory or memory leak? #6