Real-Time Dynamic Programming

While working on #48 I came to the conclusion that modern RL algorithms might be overkill for my type of problem. I went back to the tabular solving approach kicked-off in #46. I came up with a new solving algorithm that is similar to value iteration but

samples exploration paths from a dynamic environment
builds the tabular state space on the fly
does dynamic programming state-value updates in the meantime

According to Sutton and Barto book on RL, this falls into the broad category of "Asynchronous Dynamic Programming". After some googling, I think I've implemented Real-Time Dynamic Programming.

The results seem promising. I can now handle a non-truncated hence infinite state space instance of the generic DAG model for Nakamoto/Bitcoin.

I initially was hyped about this RTDP thing because

it does exploration on the fly
does not use state approximations
Barto/Sutton provide proof that it converges to the optimal policy.

After implementing the algorithm I

observed that it does not converge, instead stops exploring new states
tried to fix it and failed
noticed that the convergence is only guaranteed if all states are visited regularly (maybe all states reachable by optimal policy would be enough)
concluded that if all states are visited regularly I could just as well use traditional dynamic programming, e.g. value iteration.

Merging/closing this now, as I'm about to explore a somewhat separate idea which re-uses parts of the tooling.

pkel / cpr

Real-Time Dynamic Programming #49