takuseno / ppo

Proximal Policy Optimization implementation with TensorFlow
MIT License
103 stars 22 forks source link

Does it matter that make the pi and value in a single network? #8

Closed initial-h closed 4 years ago

initial-h commented 4 years ago

I find there is a singe network to output action and value, and the loss is loss = value_loss - policy_loss - entropy. I think they should be updated asynchronously, will it be problematic for the V and pi are updated at the same time? Thanks.

takuseno commented 4 years ago

@initial-h Hi! Thank you for reading my code! The network architecture depends on action-space. In discrete action-space, the value function and the policy function are branched from the single network. On the other hand, those two functions are separated in continuous action-space. You can find this description in the original paper.

https://arxiv.org/abs/1707.06347

initial-h commented 4 years ago

I'm sorry but I don't find the accurate explanation in the paper. PPO says the structure is the same as A3C, A3C says it's the same as DQN, it weird...

in A3C it says The agents used the network architecture from Mnih et al. [2013]. The network used a convolutional layer with 16 filters of size 8 8 with stride 4, followed by a convolutional layer with 32 filters of size 4 4 with stride 2, followed by a fully connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The value-based methods had a single linear output unit for each action representing the action-value. The model used by actor-critic agents had two set of outputs - a softmax output with one entry per action representing the probability of selecting the action, and a single linear output representing the value function. but in DQN, the fc has 512 units, Besides, the separate outputs just follow the fc layer directly without any hidder layers?

takuseno commented 4 years ago

Some relevant quotes from the paper.

If using a neural network architecture that shares parameters
between the policy and value function, we must use a loss function that combines the policy
surrogate and a value function error term. 
To represent the policy, we used a fully-connected MLP with two hidden layers of 64 units,
and tanh nonlinearities, outputting the mean of a Gaussian distribution, with variable standard
deviations, following [Sch+15b; Dua+16]. We don’t share parameters between the policy and value
function (so coefficient c1 is irrelevant), and we don’t use an entropy bonus.

And, the official implementation is found here. Actually, in Atari tasks, the network architecture is identical to DQN (Nature version) except the branched head.