Closed pauleseifert closed 1 year ago
These things are very hard to debug. I don't have any real suggestions. Improving the parallel support is on my TODO list: https://github.com/odow/SDDP.jl/issues/599.
I even have the header from the Asynchronous solve.
SDDP.jl (c) Oscar Dowson and contributors, 2017-23
problem nodes : 2521 state variables : 75 scenarios : 1.19768e+29 existing cuts : false options solver : Asynchronous mode with 5 workers. risk measure : SDDP.Expectation() sampling scheme : SDDP.InSampleMonteCarlo subproblem structure VariableRef : [168, 168] AffExpr in MOI.EqualTo{Float64} : [13, 18] AffExpr in MOI.LessThan{Float64} : [44, 46] VariableRef in MOI.EqualTo{Float64} : [32, 77] VariableRef in MOI.GreaterThan{Float64} : [21, 60] VariableRef in MOI.LessThan{Float64} : [17, 56] numerical stability report matrix range [1e+00, 1e+00] objective range [1e+00, 1e+04] bounds range [1e+01, 2e+07] rhs range [5e+00, 1e+01] WARNING: numerical stability issues detected
- bounds range contains large coefficients Very large or small absolute values of coefficients can cause numerical stability issues. Consider reformulating the model.
The problem starts after the Gurobi instances have been initialised (I get quite a few non-muteable licence dialogues). It seems to be tied to Linux only. Do you know of any additional verbose options for Gurobi for troubleshooting? I tried to alter the gurobi environment storage limit with GRBsetdblparam(env, "MemLimit", 15.0) but this didn't help.
There is no batch scheduler installed and I should have full access to the server. julia> Sys.free_memory() / 2^20 353859.84375
Happy to share more infos if useful!
So the problem is this:
Your model has 2521 nodes, and you are running with five workers. Since SDDP.jl doesn't share models between workers, that means SDDP.jl is going to create 2521 * 5 = 12,605 Gurobi models. This gives you ~30 Mb of memory for each node, but in that we also need to store some some of that is taken up with the model and some is taken up with SDDP related things. So it's plausible that you actually are running out of memory with this problem.
Why do you have so many nodes in the graph? Is it a Markovian graph?
Yes the problem is Markovian and has 42+1 stages where consecutive decisions can be made. The whole point of the exercise is to try something new out that is larger than existing applications. However, it runs with the same parameters on my MBP with 32GB of physical RAM and little to no caching. I don't expect macOS to do any magic in memory management.
However, it runs with the same parameters
Including 5 parallel threads?
and little to no caching. I don't expect macOS to do any magic in memory management
If it works on your Mac and not on the HPC, then I don't know if this is easy for me to test or debug. Have you looked at actual RAM usage with top
on your big machine when running?
Just let it run serial and wait a bit longer. The 60 Markov states don't slow it down too much, because the backwards pass implements a trick that updates every node in the stage at each iteration.
Without a reproducible example, I don't know if there's much we can do here. I'm tempted to close this issue and mark it as a request for https://github.com/odow/SDDP.jl/issues/599.
The problem was caused by the file system on the cluster. Moving to another drive solved the issue. Nothing wrong with your code :)
Hi!
I run into the following problem when trying to solve my problem on a HPC with parallelisation. The error appears on the HPC only, on my PC it runs fine, is however constraint to too few iterations by the working memory. The error code is:
The HPC runs on Rocks 7.0 and has 384GB RAM on the instance I'm running the code on. Julia@1.9.2 Gurobi@10.0.2 SDDP@1.6.0
The problem persists across different Gurobi and Julia versions. Also, different machines with the same operating system throw the same error message. A Serial version of the problem runs but takes very long to reach the iteration limit. I call the training function with:
Parameters are global and I load the additional workers at the beginning of the file:
Any ideas on how to fix this?