Failing to converge with increase in grid-size (Grid World)

rlcode / reinforcement-learning

Minimal and Clean Reinforcement Learning Examples

MIT License

3.35k stars 725 forks source link

Failing to converge with increase in grid-size (Grid World) #48

Open akileshbadrinaaraayanan opened 7 years ago

akileshbadrinaaraayanan commented 7 years ago

If I increase both the HEIGHT and WIDTH from 5 to 10 keeping the obstacles and the final goal at the same position, Deep SARSA network doesn't seem to converge. What do you think is the problem? Should I increase the depth or dimensions of the hidden layer in actor and critic networks?

Thanks, Akilesh

akileshbadrinaaraayanan commented 7 years ago

Hi,

I was running experiments with increased grid size and in some cases the action probability values become so skewed that one particular value is almost one and the rest very small (order of -20). This leads to zero cross entropy loss and basically the agent gets stuck (say it's in the top of the grid and action probability for UP is close to 1).

Any suggestions on how to overcome such situations?

keon commented 7 years ago

@Hyeokreal might be able to answer that for ya

dnddnjs commented 7 years ago

Which Algorithm do you using right now on increased grid world?

akileshbadrinaaraayanan commented 7 years ago

I am using A2C.

Cross entropy becomes zero because: say action_prob [p1, p2] (here p1 is order of 10^(-20) that is close to 0, and p2 is close to 1) and advantages = [0, advantages], then cross entropy calculation becomes log(action_prob)advantages = log(1) advantages = 0

I am not doing any random exploration, just deciding the actions based on output of actor net. Discount_factor is 0.99. The advantage estimate becomes negative in this case.

dnddnjs commented 7 years ago

There is problem of exploration when using policy gradient. In DQN, agent can explore with probability of epsilon after convergence. In actor critic, If actor network is converged, then agent is hard to explore.

I think there are two options. One is just training with low learning rate. The other is add entropy of policy to the loss funtion of actor network. If you look at a3c agent, you will find out there is entropy term in loss function.