mpatacchiola / dissecting-reinforcement-learning

Python code, PDFs and resources for the series of posts on Reinforcement Learning which I published on my personal blog
https://mpatacchiola.github.io/blog/
MIT License
609 stars 175 forks source link

about greedy agent in multi-armed bandit #12

Closed ZichaoHuang closed 5 years ago

ZichaoHuang commented 5 years ago

https://github.com/mpatacchiola/dissecting-reinforcement-learning/blob/master/src/6/multi-armed-bandit/greedy_agent_bandit.py#L51

According to the code above, we first get the so far max utility and find which arm it is, then we do a np.random.choice on range(arm). Why do we need to do a np.random.choice? Shouldn't the greedy agent just simply pick the argmax arm?

mpatacchiola commented 5 years ago

Hi @ZichaoHuang

The code looks fine to me. I should double check but what is going on is a greedy choice among the arms with maximal reward. If multiple arms gave you the same high reward, how do you decide which one to pick? Ties are broken with a random selection in this case.

reward_counter_array = np.zeros(tot_arms)

This array stores the reward for each arm.

def return_greedy_action(reward_counter_array):
    """Return an action using a greedy strategy
    @return the action selected
    """
    amax = np.amax(reward_counter_array)
    indices = np.where(reward_counter_array == amax)[0]
    action = np.random.choice(indices)
    return action

This function takes as input the reward counter, then it looks for the maximum value with amax = np.amax(reward_counter_array). Through amax it retrieves the indices where such a value is located with indices = np.where(reward_counter_array == amax)[0] (there may be multiple arm with this value); and finally it picks randomly one of these maximal arms using action = np.random.choice(indices). Be careful here, the array indices is not equal to np.arange(tot_arms), it is just a sub-set of all possible arms.

With this trick we avoid picking always the same arm (the first in the array) when multiple arms are a good choice for the greedy selection.

ZichaoHuang commented 5 years ago

Thanks for your great explanation!