Make the possibility of obtaining various rewards from the same transition more explicit

ufal / npfl122

NPFL122 repository

Creative Commons Attribution Share Alike 4.0 International

13 stars 24 forks source link

Make the possibility of obtaining various rewards from the same transition more explicit #8

Closed Karryanna closed 5 years ago

Karryanna commented 5 years ago

It feels that the examples of MDPs used in lectures and assignments encourage the idea that while an action may lead to many states, the triplet old state - action - new state always produces the same reward, which is (given that I understand it all right) incorrect. Maybe it would be worth to make this possibility more explicit?

foxik commented 5 years ago

Definitely the reward is also stochastic (that is why r is one of the arguments of the probability transition function p in definition of MDP) -- the first lecture and assignments from the first practicals are an example of this, because there are no states (i.e., a single unchanging state) and the same action yields different rewards. In fact, the stochastic rewards are the only issue one have to solve in multi-armed bandit problem.

All other assignments use deterministic rewards for a goal state, because in reality it is often the case that the rewards are deterministic for the state-action-state' triple.

Do you have an idea where to emphasize the issue?

Karryanna commented 5 years ago

Maybe I would start by saying that multi-armed bandit is an instance of MDP O:) I am sorry if you said that and I missed it (or if you expected that to be clear and I wasn't smart enough) but I did not realize it until reading your comment. For me, it felt like there is multi-armed bandit which is an interesting problem for reinforcement and then there are MDPs which is an even more interesting family of problems.

Perhaps adding a slide after the definition of MDP and showing what S, A, p and gamma are for multi-armed bandit could help. Or add a note somewhere where the sum over all possible outcomes (next state - reward pairs) is employed, saying that this cannot be evaluated always, e. g. in the multi-armed bandit problem?

foxik commented 5 years ago

I added a slide after the MDP definition showing how to cast multi-armed bandits as a MDP. Additionally, I also mention contextualized bandits, where states are introduced in a way, that reward distribution depends on the current state (and state transitions are independent on rewards).