Principled forward model for uniformly converged cost to go

dannyopts commented 3 months ago

Hi,

I am interested in understanding the cost to go function across a broad range of different states, not just on chain.

With the default forward pass we end up only exploring the state space in a restricted domain which is visitied when the system is controlled close to optimally and the model can perform poorly "off-chain".

For example I have an energy system with a number of stores and I want to understand the marginal value of additional storage in different states on the converged model, so uniformly converging the store state components of the CTG function is important.

One option would be to uniformly randomly sample from the store state in the forward pass, but I'm curious if there is a more principled way to select the next set of points to explore (forward path) in order to maximise the expected learning.

At first I thought about if bayesian optimisation or similar could be used to sample from the state space like in hyper parameter optimisation however the big difference here is that the "black box" function we want to minimise is something like a combination of expected optimal policy improvement and the expected off policy CTG error improvement which is continually changing.

Once we have visited an off-chain point, it seems reasonable to assume that the expected off policy performance improvement from visiting that point again will be significantly reduced since we have corrected the first order errors. So a reasonable approximation of expected off policy performance improvement would be the time since we evaluated near that point.

One way to do this would be to have a modified forward pass which maintains a tabu list and includes the L1 distance from each of these points as a form of regularisation.

Does this make sense? Is there a better way of approaching this problem?

odow commented 3 months ago

You are entirely correct that the forward pass does not need to "optimize" anything and that we can select the trial points somewhat arbitrarily.

I don't have any suggestions for what the right approach is, and I'm not aware of any literature on this, but it shouldn't be too hard to implement as a new forward pass: https://github.com/odow/SDDP.jl/blob/master/src/plugins/forward_passes.jl

We do a form of upper confidence bound learning in the duality handlers: https://github.com/odow/SDDP.jl/blob/f5d24b783f3e8993f2c101693b1340aa3ab8a4d8/src/plugins/duality_handlers.jl#L351-L388

We also do a form of the tabu list/L2-distance for cyclical graphs where we terminate after each cycle: https://github.com/odow/SDDP.jl/blob/f5d24b783f3e8993f2c101693b1340aa3ab8a4d8/src/plugins/forward_passes.jl#L76-L94

The closest thing is Regan's problem-child algorithm, but this selects only the realization of the random variable that leads to be the greatest learning: https://optimization-online.org/wp-content/uploads/2018/02/6449.pdf

odow commented 1 month ago

A few other comments on this:

I don't have plans to implement anything similar to this
A key feature of SDDP is precisely that it has poor approximations of the value function "off chain"
Making an accurate approximation only at the points that we need to visit is how we reduce the curse of dimensionality
If your out-of-sample forward model is very different to the forward model used in training, then you need to improve the forward model used during training. Note that the training forward pass does not need to be convex, use the same dynamics as the backward pass, etc.
A learner that picked the UCB for the "biggest improvement" would likely sample lots of unlikely places in the state space.
Regan's algorithm is nice in theory, but I don't know if anyone has an efficient implementation of it.

odow / SDDP.jl

Principled forward model for uniformly converged cost to go #766