Finite MDPs - Githubissues

makaveli10 commented 1 year ago

Anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. The agent–environment boundary represents the limit of the agent’s absolute control, not of its knowledge.
The reinforcement learning framework is a considerable abstraction of the problem of goal-directed learning from interaction. This framework may not be sufficient to represent all decision-learning problems usefully, but it has proved to be widely useful and applicable.

makaveli10 commented 1 year ago

Goals and Rewards

Maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.
The reward signal is our way of communicating to the robot what we want it to achieve, not how we want it achieved.

makaveli10 commented 1 year ago

Returns

The return G(t) is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards.
THis works for Episodic tasks which eventually end with a terminal state. But could be problematic for continuous task where the final time step T = infinite.
So, by discounting, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized.
The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only γ^k−1 times what it would be worth if it were received immediately.
If γ = 0, the agent is “myopic” in being concerned only with maximizing immediate rewards: its objective in this case is to learn how to choose A t so as to maximize only R(t+1).
As γ approaches 1, the objective takes future rewards into account more strongly: the agent becomes more farsighted.

makaveli10 commented 1 year ago

The Markov Property

A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. For example, a checkers position—the current configuration of all the pieces on the board—would serve as a Markov state because it summarizes everything important about the complete sequence of positions that led to it.
This is sometimes also referred to as an “independence of path” property because all that matters is in the current state signal; its meaning is independent of the “path,” or history, of signals that have led up to it.
If the state signal has the Markov property, on the other hand, then the environment’s response at t+1 depends only on the state and action representations at t.
Markov states provide the best possible basis for choosing actions. That is, the best policy for choosing actions as a function of a Markov state is just as good as the best policy for choosing actions as a function of complete histories.
The inability to have access to a perfect Markov state representation is probably not a severe problem for a reinforcement learning agent.

makaveli10 commented 1 year ago

Markov Decision Process

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP).
Given any state and action s and a, the probability of each possible pair of next state and reward, s', r, is denoted p(s',r|s,a) = Pr{S(t+1)=s', R(t+1)=r | S(t)=s, A(t)=a} These quantities completely specify the dynamics of a finite MDP.

makaveli10 commented 1 year ago

Value Functions

Value functions—functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return.

makaveli10 commented 1 year ago

Optimal Value Functions

A policy π is defined to be better than or equal to a policy π' if its expected return is greater than or equal to that of π' for all states. In other words, π ≥ π' if and only if v(π, s) ≥ v (π', s) for all s ∈ S. There is always at least one policy that is better than or equal to all other policies.

makaveli10 / rl