hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

[question] How to save a trained PPO2 agent to use in a Java program? #329

Closed josealeixopc closed 5 years ago

josealeixopc commented 5 years ago

I am using a PPO2 agent to train on a custom environment. I use the save function to store everything in a .pkl in the callback function, similar to the example from the Colab notebook.

def callback(_locals, _globals):
    """
    Callback called at each step (for DQN an others) or after n steps (see ACER or PPO2)
    :param _locals: (dict)
    :param _globals: (dict)
    """
    global n_steps, best_mean_reward, saving_interval, pickle_dir

    # Print stats every X calls
    if (n_steps + 1) % saving_interval == 0:
        # Evaluate policy training performance
        x, y = ts2xy(load_results(log_dir), 'timesteps')
        if len(x) > 0:
            mean_reward = np.mean(y[-100:])
            logger.info("{} timesteps".format(x[-1]))
            logger.info(
                "Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(best_mean_reward, mean_reward))

            # New best model, you could save the agent here
            if mean_reward > best_mean_reward:
                best_mean_reward = mean_reward
                # Example for saving best model
                logger.info("Saving new best model")
                _locals['self'].save(pickle_dir + 'ppo2_best_model.pkl')
    n_steps += 1
    # Returning False will stop training early
    return True

What I would like to do is extract from the .pkl file only what is necessary to take an observation and return an action. I would use this data in a Java program to get the action that I need without having to use Python. Something like a function float[] GetAction(float[] observation). I do not need to train the agent. I just need his "final" state and everything need to take an observation array and create the action array.

I believe the best way to do this would be using TensorFlow's API, more specifically the saved_mode.simple_save function, documented here. With this, I would be able to load the model in Java using the Java API from TensorFlow. However, I do not know what I should use as inputs and outputs for this function. I have tried to better understand PPO2's code, but I have limited knowledge in these TensorFlow methods and cannot figure it out.

If someone could point me in the right direction I would appreciate it.

Thanks for your help and awesome work on this repo ;)

araffin commented 5 years ago

Hello, If you want to use the final agent, then you only need to save and load the weights of the actor.

The policy network used by PPO2 is defined here

Related to that issue: https://github.com/hill-a/stable-baselines/issues/312 and https://github.com/hill-a/stable-baselines/issues/223

josealeixopc commented 5 years ago

Hello @araffin, thank you for answering!

I understand the policy matrix can be obtained by params = self.sess.run(self.params), however I don't know what operations must be done to this matrix in order to get an action from an observation. I'm guessing this comes from the TensorFlow graph.

Is there any way of saving these operations using the TensorFlow API or do I need to do it by hand?

araffin commented 5 years ago

You just need to re-create the neural network in java, you don t necessary need tf, it is just matrix multiplication followed by non linearity, and params contains the weights of the neural net (it is several matrices). To get the action, it is a forward pass in that network. You can take a look at onnx if you want to automate that. (I don t know much about tf export other than that)

josealeixopc commented 5 years ago

Hello again, sorry for the late answer.

I was able to actually get the model working user TensorFlow API. I'll post the code here later to check with you guys if it is a viable option.

I am also trying to implement the neural network with the weights from params, however, the first matrix has an input size different than my observation size. Shouldn't they be the same so that multiplication is possible? Or am I getting the run params?

EDIT: Ok, I figured out that the params of PPO are the weights for two neural networks and that the size being different is probably because the model is using binary input nodes, therefore it has an input node for each possible value. I am still trying to understand the rest.

araffin commented 5 years ago

I was able to actually get the model working user TensorFlow API. I'll post the code here later to check with you guys if it is a viable option.

Sounds good =)

params of PPO are the weights for two neural networks

Exactly, you have the weights of the actor (what you want) and the critic (only needed for training).

therefore it has an input node for each possible value.

For all the transformation happening to the input, I recommend you taking a look at https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/input.py

josealeixopc commented 5 years ago

Thank you for answering @araffin!

The code I'm using to save the model is the following:

### Update: This code has some issues. Check comments below.
def generate_checkpoint_from_model(model_path, checkpoint_name):
    model = PPO2.load(model_path)

    with model.graph.as_default():
        sess = tf_util.make_session(graph=model.graph)
        tf.global_variables_initializer().run(session=sess)

        if os.path.exists(checkpoint_name):
            shutil.rmtree(checkpoint_name)

        tf.saved_model.simple_save(sess, checkpoint_name, inputs={"obs": model.act_model.obs_ph},
                                   outputs={"action": model.action_ph})

This saves a .pb file along with the variables file, which in turn can be loaded in Java using the TensorFlow API and:

SavedModelBundle b = SavedModelBundle.load(checkpointDir, "serve");

Then, I just feed the observation as the input/Ob tensor and fetch output/strided_slice:

Tensor result = sess.runner()
                .feed("input/Ob", inputTensor)
                .fetch("output/strided_slice").run().get(0);

However, one thing I'm noticing is that every time I load the model with model = PPO2.load(model_path), even though I haven't done any training, the params change. I noticed it has something to do with the setup_model() function. Why do the weights change? Is there a way of reloading a saved model without changing the weights?

araffin commented 5 years ago

Thanks for sharing the code.

Why do the weights change? Is there a way of reloading a saved model without changing the weights?

What do you mean by "the weights change"?

josealeixopc commented 5 years ago

What do you mean by "the weights change"?

If I run the following code twice, the arrays that represent the weights inside params have different values for each run. Also, the arrays that represent the biases have every value set to 0.

model = PPO2.load(model_path)

    with model.graph.as_default():
        sess = tf_util.make_session(graph=model.graph)
        tf.global_variables_initializer().run(session=sess)
        params = sess.run(model.params)
araffin commented 5 years ago

why don't you use model.sess? (I think have to think about what could go wrong with another session)

josealeixopc commented 5 years ago

You are correct! Thanks again! :D I didn't know about the sess attribute. I have changed my code for the following, and the weights are consistent in each run (both on the Python side as well as on the Java side):

def generate_checkpoint_from_model(model_path, checkpoint_name):
    model = PPO2.load(model_path)

    with model.graph.as_default():
        if os.path.exists(checkpoint_name):
            shutil.rmtree(checkpoint_name)

        tf.saved_model.simple_save(model.sess, checkpoint_name, inputs={"obs": model.act_model.obs_ph},
                                   outputs={"action": model.action_ph})

I'll be testing if results are the same in Java and Python, but since the weights are I'm expecting the results to be as well.

jarlva commented 5 years ago

Hello, I'm a newbie trying to learn stable-baselines. I could not find in the documentation how to run and render a trained agent (using the saved weights). Would it be possible to add a section in the documentation with a simple example (like the cartpole example)?

update: Nevermind, found example in: https://stable-baselines.readthedocs.io/en/master/modules/a2c.html

araffin commented 5 years ago

@jazzchipc closing this issue at it seems to be resoved, feel free to re-open it if you encounter any pb with this topic.

Antalagor commented 4 years ago

It is very hard to find the right operation (PPO2). In my case i use continuous action space and the pi-net ist supposed to give me a mean and std and all i need is a gaussian shuffle and clipping. however, the clipping and drawing is done outside of the tf-framework. i actually fail to get the right nodes of the graph. and even at the parts of code, where a sess.run(.) gets invoked, I only find variables with rich set of dependencies.

@jazzchipc: How did you choose to fetch output/strided_slice? Why is this the node I'm looking for? In my case its yielding an INT32-Tensor which is clearly not intended (continuous action space).

josealeixopc commented 4 years ago

@Antalagor This was quite a while ago, but I remember going through the PPO2 baseline code using a debugger to see what tensors where used to get the action for my environment (which was discrete, not continuous) and seeing that it got the action from output/strided_slice. I'm afraid I can't help much more than that, sorry.

Antalagor commented 4 years ago

Thx for the answer, debugging turned out to be a great idea. I use FeedForwardPolicy. Line 576 in stable_baselines(2.8.0).common.policies invokes the sess.run yielding the action values. The Ob is named output/add in my case. (Clipping to action-space bounds is handled with numpy code outside the tf-world, so this still needs to be done afterwards in the jvm.)

I validated by feeding a constant obvservation tensor multiple times through both models (original python model and imported java repr) and comparing resulting action distributions.

bektaskemal commented 4 years ago

Hi, I wonder how should I change this function to make it work for SAC instead of PPO?

    tf.saved_model.simple_save(model.sess, checkpoint_name, inputs={"obs": model.act_model.obs_ph},
                               outputs={"action": model.action_ph})
Miffyli commented 4 years ago

@bektaskemal

You need to find the right input and output tensors for this, and good place to look for them is the predict function and associated policies. Looks like you need something like

tf.saved_model.simple_save(model.sess, checkpoint_name, 
    inputs={"obs": model.policy_tf.obs_ph},
    outputs={"action": model.policy_tf.policy}
)

Mind you, I have not tested this.

crobarcro commented 4 years ago

Thanks, but I get

AttributeError: 'PPO1' object has no attribute 'policy_tf'

and indeed, it only seems to have policy, policy_pi and policy_kwargs. I tried

inputs_dict = {
                "obs": model.policy.obs_ph
              }

outputs_dict = {
    "action": model.policy.action_ph
}

and

inputs_dict = {
                "obs": model.policy.obs_ph
              }

outputs_dict = {
    "action": model.policy.policy
}

but in both cases get

AttributeError: 'property' object has no attribute 'dtype'

Miffyli commented 4 years ago

@crobarcro

You have to find the right variables for these which vary from algorithm to algorithm. Also I suggest you use PPO2 instead of PPO1: the algorithm is same, PPO1 just uses (older, not-so-maintained) MPI to run the code, while PPO2 is more maintained.