Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption and properly handle termination due to a timeout (maximum number of steps in an episode). For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give a history of observations as input.
Termination due to timeout (max number of steps per episode) needs to be handled separately. You should fill the key in the info dict: info["TimeLimit.truncated"] = True. If you are using the gym TimeLimit wrapper, this will be done automatically. You can read Time Limit in RL or take a look at the RL Tips and Tricks video for more details.
Delayed action should not be an issue for us but termination might be.The linked paper's abstract is instructive:
In reinforcement learning, it is common to let an agent interact for a fixed amount of time with its environment before resetting it and repeating the process in a series of episodes. The task that the agent has to learn can either be to maximize its performance over (i) that fixed period, or (ii) an indefinite period where time limits are only used during training to diversify experience. In this paper, we provide a formal account for how time limits could effectively be handled in each of the two cases and explain why not doing so can cause state aliasing and invalidation of experience replay, leading to suboptimal policies and training instability. In case (i), we argue that the terminations due to time limits are in fact part of the environment, and thus a notion of the remaining time should be included as part of the agent's input to avoid violation of the Markov property. In case (ii), the time limits are not part of the environment and are only used to facilitate learning. We argue that this insight should be incorporated by bootstrapping from the value of the state at the end of each partial episode. For both cases, we illustrate empirically the significance of our considerations in improving the performance and stability of existing reinforcement learning algorithms, showing state-of-the-art results on several control tasks.
In principle, we are in case (ii). Blockchain protocols run forever and so do the attacks. We end the episode to facilitate training. The paper recommends to bootstrap the start state of the next episode from the end state of the finished episode. In our setting, this would mean that we do not reset the DAG and participant state on episode end. Instead we should only reset the reward calculation. Maybe truncate the DAG to avoid memory leaks. This is certainly feasible but requires some time to implement.
But it's not clear to me whether we indeed violate the Markov property w/o one of the above fixes. Wind-down of the episode does not do anything special. We just return done = true and restart from scratch at the end of the episode. In the dense wrapper, rewards have been calculated and reported in the previous step. Maybe I should read the full paper in order to understand the problem better.
From https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html
Delayed action should not be an issue for us but termination might be.The linked paper's abstract is instructive:
In principle, we are in case (ii). Blockchain protocols run forever and so do the attacks. We end the episode to facilitate training. The paper recommends to bootstrap the start state of the next episode from the end state of the finished episode. In our setting, this would mean that we do not reset the DAG and participant state on episode end. Instead we should only reset the reward calculation. Maybe truncate the DAG to avoid memory leaks. This is certainly feasible but requires some time to implement.
In the meantime, we can apply the solution proposal for case (i) to our problem. Add episode progress (can be chain progress, chain time, number of steps in episode) to the observation. I've implemented a wrapper for this in 509325293. https://github.com/pkel/cpr/blob/509325293edff87cde566b956417280395583b2e/python/train/ppo.py#L197-L201
But it's not clear to me whether we indeed violate the Markov property w/o one of the above fixes. Wind-down of the episode does not do anything special. We just return
done = true
and restart from scratch at the end of the episode. In the dense wrapper, rewards have been calculated and reported in the previous step. Maybe I should read the full paper in order to understand the problem better.