RL progress - Githubissues

This PR replaces the reward calculation in the gym engine. Previously, we were looking for the common chain on each step. If the common chain grew since the last step, we were calculating rewards for the new common blocks. On the last step of the episode, i.e., when setting done = True, we were calculating the rewards from the last common block to the tip of the longest chain. Changing the mechanics of the game on the last step might confuse the RL algorithm (non-markovian model). Also, it's needlessly complicated.

The new scheme calculates rewards from the root to the tip of the longest chain on each step. Internally, we remember the rewards of the last step. We hand out the delta as the step's reward. The accumulated rewards of an episode reflect the rewards as if the episode would end now. The agent gets immediate feedback if the longest chain changes.

A drawback of the new method is that it is quadratic in the length of the longest chain. This might be a problem when learning on long episodes. If so, we will resolve this later.

The PR also includes multiple insubstantial changes to the RL training script.

pkel / cpr

RL progress #5