pat-coady / trpo

Trust Region Policy Optimization with TensorFlow and OpenAI Gym
https://learningai.io/projects/2017/07/28/ai-gym-workout.html
MIT License
360 stars 106 forks source link

Help understanding how to read the code #30

Open ryanmaxwell96 opened 4 years ago

ryanmaxwell96 commented 4 years ago

Hello,

Just a quick question. In policy.py in class Policy it uses the Keras package to call "get_layer". This is the output layer correct? Also, I sent an email out so feel free to ignore this part if you already answered it, but I see from the TRPO paper that the NN is supposed to only calculate the mean and somehow uses another set of parameters which are a vector of the same size as the number of actions. But the paper is not clear to me how stdev is actually computed or updated. And in this code, all of it is computed under the hood in Keras.

Anyhelp on this would be greatly appreciated!

Ryan

ryanmaxwell96 commented 4 years ago

Also, can you help me understand why there are multiple neural network outputs? I tested the halfcheetah code and found out that the observation is a (1,27) but for some reason a (1,6) vector of means is returned and I'm at a loss as to why there are 6 means being returned.

ryanmaxwell96 commented 4 years ago

Unless it refers to the 6 half-cheetah joints that can move depending on the state (of 27 dimensions). So depending on which of these states it is in, the policy will tell it what position each of these joints should be in via means and log vars.

pat-coady commented 4 years ago

Exactly, the policy network takes a state and returns an action. The dimensionality of state and action are different. I'd have to look for sure, but my guess is the cheetah only accepts 6 actuation inputs. But many more positions and velocities are measured on the cheetah.

On Apr 4, 2020, at 7:22 PM, ryanmaxwell96 notifications@github.com wrote:

Unless it refers to the 6 half-cheetah joints that can move in which case for every state (of 27 dimensions). So depending on which of these states it is in, the policy will tell it what position each of these joints should be in via means and log vars.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pat-coady/trpo/issues/30#issuecomment-609103312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTULRDRADDKRFX4N3MP563RK66NDANCNFSM4LZQRNJQ.

ryanmaxwell96 commented 4 years ago

Ok thank you. Sorry I have another question. Where is the name "policy_nn" coming from? I'm guessing it is the last layer, correct?

ryanmaxwell96 commented 4 years ago

Can you please explain to me why in value.py it has an output Dense layer of size 1? Shouldn't it be the same size as the action dimension?

ryanmaxwell96 commented 4 years ago

Also, how do you use plotting.py? I don't see it currently being used in any of the code.