Open makaveli10 opened 1 year ago
Goals and Rewards
Returns
γ^k−1
times what it would be worth if it were received immediately.The Markov Property
t+1
depends only on the state and action representations at t
.Markov Decision Process
s
and a
, the probability of each possible pair of next state and reward, s'
, r
, is denoted
p(s',r|s,a) = Pr{S(t+1)=s', R(t+1)=r | S(t)=s, A(t)=a}
These quantities completely specify the dynamics of a finite MDP.Value Functions
Optimal Value Functions