regarding the method sample in A2C

aashish-kumar commented 6 years ago

def sample(logits): noise = tf.random_uniform(tf.shape(logits)) return tf.argmax(logits - tf.log(-tf.log(noise)), 1)

I am not reporting this as bug. I would like to know a particular reason why the noise is chosen as log(log()). I tried softmax and then sampling from probability distribution to no success. Also tried to use a peakier version, still no success. My network is learning but pg_loss is going higher negative. Also as soon as entropy falls below 0.9 it diverges again. I feel since the advantage from wronly sampled actions will be high and since their action probab is low means the pg_loss contribution will be higher. Any thoughts?

BaiGang commented 6 years ago

"I would like to know a particular reason why the noise is chosen as log(log())."

It samples noise from a uniform distribution using a Gumbel-Softmax.

On Sun, Dec 24, 2017 at 6:52 PM, aashish-kumar notifications@github.com wrote:

def sample(logits): noise = tf.random_uniform(tf.shape(logits)) return tf.argmax(logits - tf.log(-tf.log(noise)), 1)

I am not reporting this as bug. I would like to know a particular reason why the noise is chosen as log(log()). I tried softmax and then sampling from probability distribution to no success. Also tried to use a peakier version, still no success. My network is learning but pg_loss is going higher negative. Also as soon as entropy falls below 0.9 it diverges again. I feel since the advantage from wronly sampled actions will be high and since their action probab is low means the pg_loss contribution will be higher. Any thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/237, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOttAtnnJMc_d3yLDHns3DpCzFpdbkdks5tDi0KgaJpZM4RL5Pj .

aashish-kumar commented 6 years ago

I have been trying to use A2C to solve a game which has single reward for finishing it and negative reward for failing to. So each episode has a reward of +1/-1. I have been able to train a policy which converges to +1 final reward but as soon the self entropy loss goes to 0.1, the policy starts to go random and in the end converge to -1 reward. I understand A2C has problem with single pass/fail kind of reward. I am assuming the network is reaching a saddle point. Does actor-critic model guarantee to find the inflection point?

olegklimov commented 6 years ago

@aashish-kumar I think you need here look for stability. Larger batch sizes and lower step can help. To check if log(log()) if valid, please write a small program that runs this and make sure it produces samples correctly (it does).

openai / baselines

regarding the method sample in A2C #237