Closed Karryanna closed 5 years ago
Definitely the reward is also stochastic (that is why r
is one of the arguments of the probability transition function p
in definition of MDP) -- the first lecture and assignments from the first practicals are an example of this, because there are no states (i.e., a single unchanging state) and the same action yields different rewards. In fact, the stochastic rewards are the only issue one have to solve in multi-armed bandit problem.
All other assignments use deterministic rewards for a goal state, because in reality it is often the case that the rewards are deterministic for the state-action-state' triple.
Do you have an idea where to emphasize the issue?
Maybe I would start by saying that multi-armed bandit is an instance of MDP O:) I am sorry if you said that and I missed it (or if you expected that to be clear and I wasn't smart enough) but I did not realize it until reading your comment. For me, it felt like there is multi-armed bandit which is an interesting problem for reinforcement and then there are MDPs which is an even more interesting family of problems.
Perhaps adding a slide after the definition of MDP and showing what S, A, p and gamma are for multi-armed bandit could help. Or add a note somewhere where the sum over all possible outcomes (next state - reward pairs) is employed, saying that this cannot be evaluated always, e. g. in the multi-armed bandit problem?
I added a slide after the MDP definition showing how to cast multi-armed bandits as a MDP. Additionally, I also mention contextualized bandits, where states are introduced in a way, that reward distribution depends on the current state (and state transitions are independent on rewards).
It feels that the examples of MDPs used in lectures and assignments encourage the idea that while an action may lead to many states, the triplet old state - action - new state always produces the same reward, which is (given that I understand it all right) incorrect. Maybe it would be worth to make this possibility more explicit?