Exploitation is the right thing to do to maximize the expected reward on the one step, but
exploration may produce the greater total reward in the long run.
Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the “conflict” between exploration and exploitation.
The need to balance exploration and exploitation is a distinctive challenge that arises in reinforcement learning.
The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value, that is, to select at step t one of the greedy actions. This greedy action selection method can be written as At = argmax Qt(a).
A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability ε, instead to select randomly from amongst all the actions with equal probability independently of the action-value estimates.