odow / SDDP.jl

A JuMP extension for Stochastic Dual Dynamic Programming
https://sddp.dev
Other
309 stars 61 forks source link

Gurobi out of memory error on HPC #642

Closed pauleseifert closed 1 year ago

pauleseifert commented 1 year ago

Hi!

I run into the following problem when trying to solve my problem on a HPC with parallelisation. The error appears on the HPC only, on my PC it runs fine, is however constraint to too few iterations by the working memory. The error code is:

ERROR: LoadError: Gurobi Error 10001: Stacktrace: [1] _check_ret @ ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:400 [inlined] [2] Gurobi.Env(; output_flag::Int64, memory_limit::Nothing, started::Bool) @ Gurobi ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:110 [3] Env @ ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:102 [inlined] [4] Gurobi.Optimizer(env::Nothing; enable_interrupts::Bool) @ Gurobi ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:331 [5] Optimizer @ ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:325 [inlined] [6] Gurobi.Optimizer() @ Gurobi ~/.julia/packages/Gurobi/EKa6j/src/MOI_wrapper/MOI_wrapper.jl:325 [7] _instantiate_and_check(optimizer_constructor::Any) @ MathOptInterface ~/.julia/packages/MathOptInterface/864xP/src/instantiate.jl:94 [8] instantiate(optimizer_constructor::Any; with_bridge_type::Type{Float64}, with_cache_type::Nothing) @ MathOptInterface ~/.julia/packages/MathOptInterface/864xP/src/instantiate.jl:175 [9] set_optimizer(model::Model, optimizer_constructor::Any; add_bridges::Bool) @ JuMP ~/.julia/packages/JuMP/H2SWp/src/optimizer_interface.jl:361 [10] set_optimizer @ ~/.julia/packages/JuMP/H2SWp/src/optimizer_interface.jl:354 [inlined] [11] _initialize_solver(node::SDDP.Node{Tuple{Int64, Int64}}; throw_error::Bool) @ SDDP ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:325 [12] _initialize_solver @ ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:308 [inlined] [13] _initialize_solver(model::SDDP.PolicyGraph{Tuple{Int64, Int64}}; throw_error::Bool) @ SDDP ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:343 [14] _initialize_solver @ ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:341 [inlined] [15] master_loop(async::SDDP.Asynchronous, model::SDDP.PolicyGraph{Tuple{Int64, Int64}}, options::SDDP.Options{Tuple{Int64, Int64}}) @ SDDP ~/.julia/packages/SDDP/PZElX/src/plugins/parallel_schemes.jl:238 [16] train(model::SDDP.PolicyGraph{Tuple{Int64, Int64}}; iteration_limit::Int64, time_limit::Nothing, print_level::Int64, log_file::String, log_frequency::Int64, log_every_seconds::Float64, run_numerical_stability_report::Bool, stopping_rules::Vector{SDDP.AbstractStoppingRule}, risk_measure::SDDP.Expectation, sampling_scheme::SDDP.InSampleMonteCarlo, cut_type::SDDP.CutType, cycle_discretization_delta::Float64, refine_at_similar_nodes::Bool, cut_deletion_minimum::Int64, backward_sampling_scheme::SDDP.CompleteSampler, dashboard::Bool, parallel_scheme::SDDP.Asynchronous, forward_pass::SDDP.DefaultForwardPass, forward_pass_resampling_probability::Nothing, add_to_existing_cuts::Bool, duality_handler::SDDP.ContinuousConicDuality, forward_pass_callback::SDDP.var"#97#104", post_iteration_callback::SDDP.var"#98#105") @ SDDP ~/.julia/packages/SDDP/PZElX/src/algorithm.jl:1100 [17] top-level scope @ ~/SDDP/Versjon_Paul_other_prices.jl:295 in expression starting at /home/paules/SDDP/Versjon_Paul_other_prices.jl:295 From worker 16: Set parameter TokenServer to value "10.1.1.1" ┌ Warning: Forcibly interrupting busy workers │ exception = rmprocs: pids [15, 16] not terminated after 5.0 seconds. └ @ Distributed /share/apps/Julia/1.9.2-linux-x86_64/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:1253 ┌ Warning: rmprocs: process 1 not removed └ @ Distributed /share/apps/Julia/1.9.2-linux-x86_64/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:1049

The HPC runs on Rocks 7.0 and has 384GB RAM on the instance I'm running the code on. Julia@1.9.2 Gurobi@10.0.2 SDDP@1.6.0

The problem persists across different Gurobi and Julia versions. Also, different machines with the same operating system throw the same error message. A Serial version of the problem runs but takes very long to reach the iteration limit. I call the training function with:

SDDP.train( model, iteration_limit = 3000, parallel_scheme = SDDP.Asynchronous() do m::SDDP.PolicyGraph env = Gurobi.Env() GRBsetdblparam(env, "OutputFlag", 0) GRBsetdblparam(env, "LogToConsole", 0) set_optimizer(m, () -> Gurobi.Optimizer(env)) add_to_existing_cuts = true end, )

Parameters are global and I load the additional workers at the beginning of the file:

using Distributed Distributed.addprocs(5) @everywhere using Gurobi @everywhere using SDDP

Any ideas on how to fix this?

odow commented 1 year ago

These things are very hard to debug. I don't have any real suggestions. Improving the parallel support is on my TODO list: https://github.com/odow/SDDP.jl/issues/599.

pauleseifert commented 1 year ago

I even have the header from the Asynchronous solve.


SDDP.jl (c) Oscar Dowson and contributors, 2017-23

problem nodes : 2521 state variables : 75 scenarios : 1.19768e+29 existing cuts : false options solver : Asynchronous mode with 5 workers. risk measure : SDDP.Expectation() sampling scheme : SDDP.InSampleMonteCarlo subproblem structure VariableRef : [168, 168] AffExpr in MOI.EqualTo{Float64} : [13, 18] AffExpr in MOI.LessThan{Float64} : [44, 46] VariableRef in MOI.EqualTo{Float64} : [32, 77] VariableRef in MOI.GreaterThan{Float64} : [21, 60] VariableRef in MOI.LessThan{Float64} : [17, 56] numerical stability report matrix range [1e+00, 1e+00] objective range [1e+00, 1e+04] bounds range [1e+01, 2e+07] rhs range [5e+00, 1e+01] WARNING: numerical stability issues detected

  • bounds range contains large coefficients Very large or small absolute values of coefficients can cause numerical stability issues. Consider reformulating the model.

The problem starts after the Gurobi instances have been initialised (I get quite a few non-muteable licence dialogues). It seems to be tied to Linux only. Do you know of any additional verbose options for Gurobi for troubleshooting? I tried to alter the gurobi environment storage limit with GRBsetdblparam(env, "MemLimit", 15.0) but this didn't help.

There is no batch scheduler installed and I should have full access to the server. julia> Sys.free_memory() / 2^20 353859.84375

Happy to share more infos if useful!

odow commented 1 year ago

So the problem is this:

Your model has 2521 nodes, and you are running with five workers. Since SDDP.jl doesn't share models between workers, that means SDDP.jl is going to create 2521 * 5 = 12,605 Gurobi models. This gives you ~30 Mb of memory for each node, but in that we also need to store some some of that is taken up with the model and some is taken up with SDDP related things. So it's plausible that you actually are running out of memory with this problem.

Why do you have so many nodes in the graph? Is it a Markovian graph?

pauleseifert commented 1 year ago

Yes the problem is Markovian and has 42+1 stages where consecutive decisions can be made. The whole point of the exercise is to try something new out that is larger than existing applications. However, it runs with the same parameters on my MBP with 32GB of physical RAM and little to no caching. I don't expect macOS to do any magic in memory management.

odow commented 1 year ago

However, it runs with the same parameters

Including 5 parallel threads?

and little to no caching. I don't expect macOS to do any magic in memory management

If it works on your Mac and not on the HPC, then I don't know if this is easy for me to test or debug. Have you looked at actual RAM usage with top on your big machine when running?

Just let it run serial and wait a bit longer. The 60 Markov states don't slow it down too much, because the backwards pass implements a trick that updates every node in the stage at each iteration.

odow commented 1 year ago

Without a reproducible example, I don't know if there's much we can do here. I'm tempted to close this issue and mark it as a request for https://github.com/odow/SDDP.jl/issues/599.

pauleseifert commented 1 year ago

The problem was caused by the file system on the cluster. Moving to another drive solved the issue. Nothing wrong with your code :)