Non-Stationary Environment - Return rewards without replacement

To consistently use the non-stationary data, we need two modifications to the function that returns the reward of an action (only for the non-stationary case, the existing cases remain the same).

1- For the non-stationary environment, each utility_increase that is returned to the agent should be removed (or marked as used), so it is not returned again in another call to the environment.

2- For the non-stationary environment, the utility_increase has to be consumed in a same order, because they are a time series now.

hpi-sam / rl-4-self-repair

Non-Stationary Environment - Return rewards without replacement #21