Multi arm Bandits - Githubissues

Action Value methods

The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value, that is, to select at step t one of the greedy actions. This greedy action selection method can be written as At = argmax Qt(a).
A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability ε, instead to select randomly from amongst all the actions with equal probability independently of the action-value estimates.
NewEstimate ← OldEstimate + StepSize [Target − OldEstimate]
Q(k+1) = Q(k) + alpha*[R(k) - Q(k)], general update rule for values.

makaveli10 / rl