Key findings 17-08-21

Regarding the following experiments:

Whether the evaluation reward¹ is (more or less) constant

1) when training the same agent multiple times on 1 episode

2) when training the same agent multiple times on 1 episode and we do not shuffle the sequences

3) when training the same agent multiple times on 1 episode and apply an exploration rate of 0.02

1.)

No, the evaluation reward fluctuates a lot. Although the train reward stays more or less constant. E.g. it's most likely not due to a local minima.

Relevant experiment id: Exp: Constant Reward 1 Episode

2.)

Strangely, the agent isn't able to learn anything.

Relevant experiment id: Exp: Constant Reward 1 Episode, no shuffle

3.)

As in 1.), the evaluation reward still fluctuates a lot. However, the resulting probabilities lower compared to an agent without exploration on (compare the eval_probability_stats.csv of the runs in the experiment id of 1.) and the ones in the experiment id of 3.)) Relevant experiment id: Exp: Constant Reward 1 Episode, exploration on

¹: Evaluation reward refers to the reward from the evaluation framework and not from the environment the agent is trained with.

Regarding the differences between high evaluation reward agents and low ev. r. agents

Examined on experiment 1).

There seems to be no obvious difference between the two

Actions are not based on alphabetical sorting
Only minor differences in the probability distribution

but

those agents with higher ev. r. usually have a
- much lower mean/median price
- higher quantity

We might, therefore, conclude, that the changes are due to some learned policies being more beneficial to the evaluation reward than others.

Therefore, the following steps are proposed:

[ ] Understand RL
[ ] Understand PPO
[ ] Understand what is the cause of above mentioned behaviour

Open questions

[ ] Can we actually shuffle the sequences like in https://github.com/lucasfbn/Trendstuff/blob/94beac98c2d8f3c0660b72e9080069eefc15dc29/rl/env.py#L120-L121

Update 26.08.21: Makes no sense, I don't know why I did this. Fixed in current version.

lucasfbn commented 3 years ago

Moved to #92.

lucasfbn / Reddit-Sentiment-Reinforcement-Learning

Regarding continuous evaluation reward improvements and high probabilities of the agents. #83

Key findings 17-08-21

Regarding the following experiments:

1.)

2.)

3.)

Regarding the differences between high evaluation reward agents and low ev. r. agents

Open questions