Closed josealeixopc closed 5 years ago
Hello, If you want to use the final agent, then you only need to save and load the weights of the actor.
The policy network used by PPO2 is defined here
Related to that issue: https://github.com/hill-a/stable-baselines/issues/312 and https://github.com/hill-a/stable-baselines/issues/223
Hello @araffin, thank you for answering!
I understand the policy matrix can be obtained by params = self.sess.run(self.params)
, however I don't know what operations must be done to this matrix in order to get an action
from an observation
. I'm guessing this comes from the TensorFlow graph.
Is there any way of saving these operations using the TensorFlow API or do I need to do it by hand?
You just need to re-create the neural network in java, you don t necessary need tf, it is just matrix multiplication followed by non linearity, and params contains the weights of the neural net (it is several matrices). To get the action, it is a forward pass in that network. You can take a look at onnx if you want to automate that. (I don t know much about tf export other than that)
Hello again, sorry for the late answer.
I was able to actually get the model working user TensorFlow API. I'll post the code here later to check with you guys if it is a viable option.
I am also trying to implement the neural network with the weights from params
, however, the first matrix has an input size different than my observation size. Shouldn't they be the same so that multiplication is possible? Or am I getting the run params?
EDIT: Ok, I figured out that the params
of PPO are the weights for two neural networks and that the size being different is probably because the model is using binary input nodes, therefore it has an input node for each possible value. I am still trying to understand the rest.
I was able to actually get the model working user TensorFlow API. I'll post the code here later to check with you guys if it is a viable option.
Sounds good =)
params of PPO are the weights for two neural networks
Exactly, you have the weights of the actor (what you want) and the critic (only needed for training).
therefore it has an input node for each possible value.
For all the transformation happening to the input, I recommend you taking a look at https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/input.py
Thank you for answering @araffin!
The code I'm using to save the model is the following:
### Update: This code has some issues. Check comments below.
def generate_checkpoint_from_model(model_path, checkpoint_name):
model = PPO2.load(model_path)
with model.graph.as_default():
sess = tf_util.make_session(graph=model.graph)
tf.global_variables_initializer().run(session=sess)
if os.path.exists(checkpoint_name):
shutil.rmtree(checkpoint_name)
tf.saved_model.simple_save(sess, checkpoint_name, inputs={"obs": model.act_model.obs_ph},
outputs={"action": model.action_ph})
This saves a .pb
file along with the variables file, which in turn can be loaded in Java using the TensorFlow API and:
SavedModelBundle b = SavedModelBundle.load(checkpointDir, "serve");
Then, I just feed the observation as the input/Ob
tensor and fetch output/strided_slice
:
Tensor result = sess.runner()
.feed("input/Ob", inputTensor)
.fetch("output/strided_slice").run().get(0);
However, one thing I'm noticing is that every time I load the model with model = PPO2.load(model_path)
, even though I haven't done any training, the params
change. I noticed it has something to do with the setup_model()
function. Why do the weights change? Is there a way of reloading a saved model without changing the weights?
Thanks for sharing the code.
Why do the weights change? Is there a way of reloading a saved model without changing the weights?
What do you mean by "the weights change"?
What do you mean by "the weights change"?
If I run the following code twice, the arrays that represent the weights inside params
have different values for each run. Also, the arrays that represent the biases have every value set to 0.
model = PPO2.load(model_path)
with model.graph.as_default():
sess = tf_util.make_session(graph=model.graph)
tf.global_variables_initializer().run(session=sess)
params = sess.run(model.params)
why don't you use model.sess
? (I think have to think about what could go wrong with another session)
You are correct! Thanks again! :D I didn't know about the sess
attribute. I have changed my code for the following, and the weights are consistent in each run (both on the Python side as well as on the Java side):
def generate_checkpoint_from_model(model_path, checkpoint_name):
model = PPO2.load(model_path)
with model.graph.as_default():
if os.path.exists(checkpoint_name):
shutil.rmtree(checkpoint_name)
tf.saved_model.simple_save(model.sess, checkpoint_name, inputs={"obs": model.act_model.obs_ph},
outputs={"action": model.action_ph})
I'll be testing if results are the same in Java and Python, but since the weights are I'm expecting the results to be as well.
Hello, I'm a newbie trying to learn stable-baselines. I could not find in the documentation how to run and render a trained agent (using the saved weights). Would it be possible to add a section in the documentation with a simple example (like the cartpole example)?
update: Nevermind, found example in: https://stable-baselines.readthedocs.io/en/master/modules/a2c.html
@jazzchipc closing this issue at it seems to be resoved, feel free to re-open it if you encounter any pb with this topic.
It is very hard to find the right operation (PPO2). In my case i use continuous action space and the pi-net ist supposed to give me a mean and std and all i need is a gaussian shuffle and clipping. however, the clipping and drawing is done outside of the tf-framework. i actually fail to get the right nodes of the graph. and even at the parts of code, where a sess.run(.)
gets invoked, I only find variables with rich set of dependencies.
@jazzchipc: How did you choose to fetch output/strided_slice
? Why is this the node I'm looking for? In my case its yielding an INT32-Tensor which is clearly not intended (continuous action space).
@Antalagor This was quite a while ago, but I remember going through the PPO2 baseline code using a debugger to see what tensors where used to get the action for my environment (which was discrete, not continuous) and seeing that it got the action from output/strided_slice
. I'm afraid I can't help much more than that, sorry.
Thx for the answer, debugging turned out to be a great idea. I use FeedForwardPolicy
. Line 576 in stable_baselines(2.8.0).common.policies
invokes the sess.run
yielding the action values. The Ob is named output/add
in my case. (Clipping to action-space bounds is handled with numpy code outside the tf-world, so this still needs to be done afterwards in the jvm.)
I validated by feeding a constant obvservation tensor multiple times through both models (original python model and imported java repr) and comparing resulting action distributions.
Hi, I wonder how should I change this function to make it work for SAC instead of PPO?
tf.saved_model.simple_save(model.sess, checkpoint_name, inputs={"obs": model.act_model.obs_ph}, outputs={"action": model.action_ph})
@bektaskemal
You need to find the right input and output tensors for this, and good place to look for them is the predict
function and associated policies. Looks like you need something like
tf.saved_model.simple_save(model.sess, checkpoint_name,
inputs={"obs": model.policy_tf.obs_ph},
outputs={"action": model.policy_tf.policy}
)
Mind you, I have not tested this.
Thanks, but I get
AttributeError: 'PPO1' object has no attribute 'policy_tf'
and indeed, it only seems to have policy
, policy_pi
and policy_kwargs
. I tried
inputs_dict = {
"obs": model.policy.obs_ph
}
outputs_dict = {
"action": model.policy.action_ph
}
and
inputs_dict = {
"obs": model.policy.obs_ph
}
outputs_dict = {
"action": model.policy.policy
}
but in both cases get
AttributeError: 'property' object has no attribute 'dtype'
@crobarcro
You have to find the right variables for these which vary from algorithm to algorithm. Also I suggest you use PPO2 instead of PPO1: the algorithm is same, PPO1 just uses (older, not-so-maintained) MPI to run the code, while PPO2 is more maintained.
I am using a PPO2 agent to train on a custom environment. I use the
save
function to store everything in a.pkl
in the callback function, similar to the example from the Colab notebook.What I would like to do is extract from the
.pkl
file only what is necessary to take an observation and return an action. I would use this data in a Java program to get the action that I need without having to use Python. Something like a functionfloat[] GetAction(float[] observation)
. I do not need to train the agent. I just need his "final" state and everything need to take an observation array and create the action array.I believe the best way to do this would be using TensorFlow's API, more specifically the
saved_mode.simple_save
function, documented here. With this, I would be able to load the model in Java using the Java API from TensorFlow. However, I do not know what I should use asinputs
andoutputs
for this function. I have tried to better understand PPO2's code, but I have limited knowledge in these TensorFlow methods and cannot figure it out.If someone could point me in the right direction I would appreciate it.
Thanks for your help and awesome work on this repo ;)