Closed pkel closed 4 months ago
I've implemented a first draft in December 2023. It's using the Bitcoin-only model described at FC '16.
I think the next step is to implement the generic model of https://arxiv.org/abs/2309.11924.
This seems to work now, but I feel no motivation to apply modern RL algorithms to this environment. Implementing this produced some valuable insights though. Tabular methods must suffice. The ideas live on in #49.
My original approach was to
Recently, I came up with simpler approach. See #46 and https://arxiv.org/abs/2309.11924
Now, the MDPs grow too big and my python implementation is too slow, which is kind of expected.
In this PR I want to explore a hybrid approach, taking the best ideas from both worlds:
The simulator-based gym has a problem not being Markovian. It uses an internal (non-observable) progress counter which triggers termination. This might cause problems in the RL.
With the new apporach, we should obtain an implicit MDP, maybe POMDP, from step 4. Due to probabilistic termination new model is Markovian. It's also much simpler than simulating individual messages. And it's easier to compare it to existing results.
I'm going for path 5a now, first putting MDP generation aside. Maybe I can keep an eye on it for future code reuse. Simulation is much simpler than exploration.
As this requires rewriting most of #46, I start from scratch. I use Rust with pyo3, because I want to learn rust anyway.