Closed aferragu closed 3 years ago
Everything you said is correct :)
I chose the asynchronous version because it was easy to implement. You can easily think up much more ambitious parallelization schemes (as you have done).
The main issue with the asynchronous version is that with lots of cores, you end up doing lots of iterations poor approximations of the value function. (Consider the first iteration. Each core does an iteration with no value information, so you discover 72 useless cuts.) The second issue is that the master process collecting the cuts and sharing them back out gets a bit overloaded.
Let me know if you think this is feasible (or plain wrong :-)),
Sounds fantastic!
point us in the right direction, and maybe we can contribute with some of this.
A basic outline would be:
Step 1. Define a new subtype of SDDPSolveType
just like
https://github.com/odow/SDDP.jl/blob/4b65882ca8f63b38288deba781026b8f81632632/src/solve_asynchronous.jl#L30-L33
Step 2. Overload the solve
method using your new parallel type just like
https://github.com/odow/SDDP.jl/blob/4b65882ca8f63b38288deba781026b8f81632632/src/solve_asynchronous.jl#L93
Write a new solve method using liberal copy and paste until you get something working. Then we can clean it up and integrate it more cleanly.
You should be able to pull in much of the existing machinery. In fact, you should be able to take most of the serial code, and modify this function: https://github.com/odow/SDDP.jl/blob/4b65882ca8f63b38288deba781026b8f81632632/src/defaultvaluefunction.jl#L141-L165
Some things to consider:
It would be super great if you wanted to attack this. If you need any help, just develop everything on a branch and then submit a PR and we can discuss in there!
The main issue with the asynchronous version is that with lots of cores, you end up doing lots of iterations poor approximations of the value function. (Consider the first iteration. Each core does an iteration with no value information, so you discover 72 useless cuts.)
That's exactly why I started looking into this. In fact, why do you use step<1 then? Seems to me that step=1 would lead to a first pass, then add another slave and do two parallel passes, then 4 and then exponentially until using all cores...or maybe just unleashing the full cores only after the first pass.
As for the PR, let me look into this a little since I'm not an expert in Julia, but it would be great if we can help. I have to fork the repository to create a branch or just branch it locally and then submit the PR in order to publish?
Closing because I don't implement this, and the above hints are no longer relevant. In addition, the overhead (in my developer time) of maintaining another hard-to-test parallelization scheme probably outweighs the benefits.
Hi Oscar:
We are working together with @rodripor in a hydro-thermal scheduling model. The main difference between our model and your examples is that we have two timescales (hours and days), so every stage has 24x times variables to account for demand and generation throughout the day, with a daily water balance constraint. The days are the stages of our problem formulation.
Today we started to analyze the parallel version of your implementation, with the
Asynchronous()
solver. I refactored our code to use the asynchronous version (which was extremely easy thanks to your implementation :-)) and ran it in an AWS c5.18xlarge instance (72 cores, 144Gb RAM). However the performance gains were not that big (as compared to my computer) and I would like to understand why, and try to hint towards a possible solution.If I understand correctly your code, the
Asynchronous()
solver performs several forward-backward passes of the algorithm in parallel, and shares the cuts generated among them. However, if I understand correctly the SDDP algorithm, in each stage of each backward pass, you have to optimize for every possible noise realization, and average them to get the cut. So, if the number of possible noise outcomes is large, every backward pass has a lot to do in each stage, and this is not parallelized (please correct me if anything I said before is wrong!)In our case, we have 365 stages and approx 100 noise realizations (is a bit more complicated due to the Markov states, but never mind). Seems to me that, in our case, it would be better to perform one backward-pass, but in every stage of that backward pass parallelize the optimization for every noise realization. So the pass would be faster and the cuts will be available for the next iteration.
A more clever thing to do (daydreaming now) would be to split your available cores: if you have n noise realizations and M>n cores, then parallelize your n rhsnoise optimizations and do M/n forward-backward passes also in parallel, with n cores devoted to each one.
Let me know if you think this is feasible (or plain wrong :-)), and point us in the right direction, and maybe we can contribute with some of this.