rlcode / reinforcement-learning

Minimal and Clean Reinforcement Learning Examples
MIT License
3.37k stars 728 forks source link

Implementing policy gradient when number of output classes is large #87

Open hoangcuong2011 opened 5 years ago

hoangcuong2011 commented 5 years ago

Hello,

I am aware of this smart trick of implementing policy gradient (see his for a reference: https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py). Specifically, categorical cross entropy is defined H(p, q) = sum(p_i log(q_i)). For the action taken, a, we can set p_a = advantage [index of action a in 1-hot-vector representation). Meanwhile, q_a is the output of the policy network, which is the probability of taking the action a, i.e. policy(s, a).

However, when the classes of output is huge (e.g. as in machine translation or language modeling), I simply cannot convert the output into one hot vector in the first place, using to_categorical(output, num_classes=output_class) function in keras.

Because of this, I cannot apply the trick to compute p_a.

So how to implement policy gradient in this case?

I hope I make my question in a clear way!

Many thanks for your help!

Best,

Cuong

@fredcallaway: I saw you commented on the code so I tagged you here as well. If you can give me an answer, I would really appreciate it ...