Closed lucasfbn closed 3 years ago
Whether the evaluation reward1 is (more or less) constant
1) when training the same agent multiple times on 1 episode
2) when training the same agent multiple times on 1 episode and we do not shuffle the sequences
3) when training the same agent multiple times on 1 episode and apply an exploration rate of 0.02
No, the evaluation reward fluctuates a lot. Although the train reward stays more or less constant. E.g. it's most likely not due to a local minima.
Relevant experiment id: Exp: Constant Reward 1 Episode
Strangely, the agent isn't able to learn anything.
Relevant experiment id: Exp: Constant Reward 1 Episode, no shuffle
As in 1.), the evaluation reward still fluctuates a lot. However, the resulting probabilities lower compared to an agent without exploration on (compare the eval_probability_stats.csv of the runs in the experiment id of 1.) and the ones in the experiment id of 3.)) Relevant experiment id: Exp: Constant Reward 1 Episode, exploration on
1: Evaluation reward refers to the reward from the evaluation framework and not from the environment the agent is trained with.
Examined on experiment 1).
There seems to be no obvious difference between the two
but
We might, therefore, conclude, that the changes are due to some learned policies being more beneficial to the evaluation reward than others.
Therefore, the following steps are proposed:
[ ] Can we actually shuffle the sequences like in https://github.com/lucasfbn/Trendstuff/blob/94beac98c2d8f3c0660b72e9080069eefc15dc29/rl/env.py#L120-L121
Update 26.08.21: Makes no sense, I don't know why I did this. Fixed in current version.
Moved to #92.
Combined issue for #81 and #72