In any learning scenario for e.g. driving a car, we are acutely aware of how our environment responds to what we do, and we seek to influence what happens through our behavior.
Learning by interaction is the core idea underlying almost all theories of learning and intelligence.
Map situations to actions in order to maximize a numerical reward signal.
Challenge of trade-off between exploration and exploitation.
so, RL is a framework for solving control tasks by building agents that learn from the environment by interacting through trial and error and receiving rewards as unique feedback.
As the main objective of the RL agent is to maximize cumulative reward, agent must prefer actions that it has tried in the past and found to be effective in producing reward.
But to discover such actions, it has to try actions that it has not selected before.
Basically, the agent has to exploit existing information(what it already knows) in order to maximize rewards, but it also needs to explore so as to discover actions which might produce higher rewards.
Policy defines the behavior of the agent to maximize cumulative reward. Basically, tells the agent which action to take in a particular state. Policy can be both deterministic where a given state will always return the same action and stochastic which gives a probability distribution over actions given a state.
Reward signal defines the goal in RL. Agents sole purpose is to maximize the cumulative reward it receives over the long run.
Value of a state is the expected discounted return the agent can get if it starts from that state. rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states.
A model of the environment mimics the behavior of the environment. More generally, allows inferences to be made how the environment will behave. Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners.
Learning