Parallel optimization of RHS Noises

aferragu commented 6 years ago

Hi Oscar:

We are working together with @rodripor in a hydro-thermal scheduling model. The main difference between our model and your examples is that we have two timescales (hours and days), so every stage has 24x times variables to account for demand and generation throughout the day, with a daily water balance constraint. The days are the stages of our problem formulation.

Today we started to analyze the parallel version of your implementation, with the Asynchronous() solver. I refactored our code to use the asynchronous version (which was extremely easy thanks to your implementation :-)) and ran it in an AWS c5.18xlarge instance (72 cores, 144Gb RAM). However the performance gains were not that big (as compared to my computer) and I would like to understand why, and try to hint towards a possible solution.

If I understand correctly your code, the Asynchronous() solver performs several forward-backward passes of the algorithm in parallel, and shares the cuts generated among them. However, if I understand correctly the SDDP algorithm, in each stage of each backward pass, you have to optimize for every possible noise realization, and average them to get the cut. So, if the number of possible noise outcomes is large, every backward pass has a lot to do in each stage, and this is not parallelized (please correct me if anything I said before is wrong!)

In our case, we have 365 stages and approx 100 noise realizations (is a bit more complicated due to the Markov states, but never mind). Seems to me that, in our case, it would be better to perform one backward-pass, but in every stage of that backward pass parallelize the optimization for every noise realization. So the pass would be faster and the cuts will be available for the next iteration.

A more clever thing to do (daydreaming now) would be to split your available cores: if you have n noise realizations and M>n cores, then parallelize your n rhsnoise optimizations and do M/n forward-backward passes also in parallel, with n cores devoted to each one.

Let me know if you think this is feasible (or plain wrong :-)), and point us in the right direction, and maybe we can contribute with some of this.

odow commented 6 years ago

Everything you said is correct :)

I chose the asynchronous version because it was easy to implement. You can easily think up much more ambitious parallelization schemes (as you have done).

The main issue with the asynchronous version is that with lots of cores, you end up doing lots of iterations poor approximations of the value function. (Consider the first iteration. Each core does an iteration with no value information, so you discover 72 useless cuts.) The second issue is that the master process collecting the cuts and sharing them back out gets a bit overloaded.

Let me know if you think this is feasible (or plain wrong :-)),

Sounds fantastic!

point us in the right direction, and maybe we can contribute with some of this.

A basic outline would be:

Step 1. Define a new subtype of SDDPSolveType just like https://github.com/odow/SDDP.jl/blob/4b65882ca8f63b38288deba781026b8f81632632/src/solve_asynchronous.jl#L30-L33

Step 2. Overload the solve method using your new parallel type just like https://github.com/odow/SDDP.jl/blob/4b65882ca8f63b38288deba781026b8f81632632/src/solve_asynchronous.jl#L93

Write a new solve method using liberal copy and paste until you get something working. Then we can clean it up and integrate it more cleanly.

You should be able to pull in much of the existing machinery. In fact, you should be able to take most of the serial code, and modify this function: https://github.com/odow/SDDP.jl/blob/4b65882ca8f63b38288deba781026b8f81632632/src/defaultvaluefunction.jl#L141-L165

Some things to consider:

you can't serialize JuMP models between cores. However, you don't want to rebuild them each time either. So you really want to associate models to cores and keep them there.
My approach was to build full models on all the cores, and keep them all in sync. (That hurts our memory requirement though.) (Side note, the new version of JuMP will enable us to halve our memory requirement.)
It's actually really, really fast to solve repeated RHS's in sequence. It's probably faster to just parallelize the realizations within a Markov state.
Since we're a long way from the metal (SDDP.jl > JuMP > Gurobi.jl > C), the overhead might be quite large. This might be a lot of work for little benefit. (But I have no idea of the magnitude.)
Before you embark on this, have a look at https://github.com/odow/SDDP.jl/pull/62. In fact, just go through JuMPfunctions and comment out anything you're not needing. This can give an improvement.

It would be super great if you wanted to attack this. If you need any help, just develop everything on a branch and then submit a PR and we can discuss in there!

aferragu commented 6 years ago

The main issue with the asynchronous version is that with lots of cores, you end up doing lots of iterations poor approximations of the value function. (Consider the first iteration. Each core does an iteration with no value information, so you discover 72 useless cuts.)

That's exactly why I started looking into this. In fact, why do you use step<1 then? Seems to me that step=1 would lead to a first pass, then add another slave and do two parallel passes, then 4 and then exponentially until using all cores...or maybe just unleashing the full cores only after the first pass.

As for the PR, let me look into this a little since I'm not an expert in Julia, but it would be great if we can help. I have to fork the repository to create a branch or just branch it locally and then submit the PR in order to publish?

odow commented 3 years ago

Closing because I don't implement this, and the above hints are no longer relevant. In addition, the overhead (in my developer time) of maintaining another hard-to-test parallelization scheme probably outweighs the benefits.

odow / SDDP.jl

Parallel optimization of RHS Noises #159