Possible numerical Instability of gradient calculation in PPO2 (?)

jkuball commented 5 years ago

First of all, I'm not really sure whether this is a problem on my side or a bug on your side. But I'm trying to debug this for some days now and I really don't know what to do anymore.

Bug description

The bug I'm facing is easily described: while training I get NaN values while training a MlpPolicy with PPO2 on a custom environment I'm writing for my master's thesis.

The stacktrace is the following:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
         [[{{node loss/VerifyFinite/CheckNumerics}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 90, in <module>
    model.learn(config["ppo"]["num_timesteps"])
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 307, in learn
    update=timestep))
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 261, in _train_step
    [self.pg_loss, self.vf_loss, self.entropy, self.approxkl, self.clipfrac, self._train], td_map)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
         [[node loss/VerifyFinite/CheckNumerics (defined at /home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py:175) ]]

Caused by op 'loss/VerifyFinite/CheckNumerics', defined at:
  File "train.py", line 81, in <module>
    ent_coef=config["ppo"]["entropy_coefficient"],
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 93, in __init__
    self.setup_model()
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 175, in setup_model
    grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
    return verify_tensor_all_finite_v2(t, msg, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
    verify_input = array_ops.check_numerics(x, message=message)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
         [[node loss/VerifyFinite/CheckNumerics (defined at /home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py:175) ]]

It looks like the NaNs are occuring in this call of tf.gradients. For further debugging I added some assertions:

diff --git a/stable_baselines/ppo2/ppo2.py b/stable_baselines/ppo2/ppo2.py
index eb009ce..0af1e9e 100644
--- a/stable_baselines/ppo2/ppo2.py
+++ b/stable_baselines/ppo2/ppo2.py
@@ -170,7 +170,14 @@ class PPO2(ActorCriticRLModel):
                         if self.full_tensorboard_log:
                             for var in self.params:
                                 tf.summary.histogram(var.name, var)
+
+                    loss = tf.debugging.assert_all_finite(loss, msg="rip loss")
+
                     grads = tf.gradients(loss, self.params)
+
+                    grads = [ tf.debugging.assert_all_finite(grad, msg=f"rip grad{i}") if grad is not None else None
+                              for i, grad in enumerate(grads) ]
+
                     if self.max_grad_norm is not None:
                         grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
                     grads = list(zip(grads, self.params))

With those assertions added, I'm really sure that the tf.gradients call is the problem and the NaNs aren't propagated from the loss variable, since the gradient with the index of 14 is the one that raises the error.

Googling leads me to the assumption that this has to do with the numerical instability of the gradient calculation, so I thought it might be possible to add an epsilon ontop of the loss variable.

+                    eps = tf.constant(1e-7)
+                    loss = tf.add(loss, eps)

Sadly, this doesn't help and the error persists. I'm not really sure what to do next and it doesn't help that any test needs multiple hours to verify.

Code example

I can't provide a minimal code example and the problem occurs only after one to three hour training on my machine, but I'll happily test anything anyone suggests. I am grateful for every comment, I really have to fix this.

System Info

I don't think this is a hardware- or installation problem, but I'll add the system info:

I installed via pip install -e .
We have a Titan X and a RTX 2060 for training
We're using Python 3.6.7
We're using Tensorflow 1.13.1
We're using Cuda 10.0
I don't think there are other relevant libraries

araffin commented 5 years ago

Hello,

I installed via pip install -e .

What is your version? master? (v2.5.1?)

Could you provide the hyperparameters used, and did you see any other metric having too high values using tensorboard ? Did you have the same problem with other algorithms? (A2C for instance)

Unfortunately, this is surely due to either your environment or the chosen hyper-parameters.

jkuball commented 5 years ago

Hi! Thanks for the fast answer.

What is your version? master? (v2.5.1?)

I was on bea2eed5b45cce875a762c55f342b1ba4e08fd3a and have now updated to the latest version.

Did you have the same problem with other algorithms? (A2C for instance)

I did not test this. I can, but I really need to use PPO2.

Could you provide the hyperparameters used [...]

Mostly I'm using the default values, but chose the batches relatively arbitrarily. I have a vecenv with 256 agents, a trajectory horizon of 128 and a nminibatch of 2048.

[...] and did you see any other metric having too high values using tensorboard ?

I don't think so, unfortunately I don't have the data right now to review. I'm currently generating a new set.

Unfortunately, this is surely due to either your environment or the chosen hyper-parameters.

That might very well the case, I'm really unexperienced in the field of reinforcement learning. There is a good chance I chose bad hyperparameters. (It does help to hear that you think that this is the problem and nothing in the PPO implementation itself.)

My environment is a very sparse rewarding one, so I wrote a vecenv curiosity wrapper using Keras. Also I'm using the VecNormalize wrapper of this project to normalize the rewards.

And now that I'm writing this, I have a suspicion. Since I'm normalizing the sum of the intrinsic and extrinsic rewards -- and the extrinsic reward is mostly 0 -- could it be that it does explode when the environment is "solved", because then the extrinsic reward is just too high? I set this to 100 which is probably really far from the moving average in the VecNormalize wrapper. Could it be that a reward that is too high causes this issue?

araffin commented 5 years ago

but I really need to use PPO2.

I don't see obvious reason not to test A2C (which is really close to ppo) or another algorithm (depending on what action space you are using), except if you have done custom changes to it.

Could it be that a reward that is too high causes this issue?

It is hard to tell without looking at log of data and knowing the problem. But this more technical support and we don't do that, mainly because we want to focus on stable-baselines (and we already have only little time for doing so), not a particular custom environment.

So, I would suggest you to do more investigation, to track down where does the NaN comes from, and if you find clues that it may come from the algorithm, then please come back to us so we can fix that ;)

jkuball commented 5 years ago

I don't see obvious reason not to test A2C (which is really close to ppo) or another algorithm

You're probably right. There are some reasons in my case, but that's another story.

But this more technical support and we don't do that [...]

Yeah, I wasn't sure whether this really was a bug or not, but it looks like it's an oversight on my side. That's good to know, now I have to debug more and more. Thank you for your time!

jkuball commented 5 years ago

For everyone that stumbles upon this issue via google: For my case it looks like I had an entropy coefficient that was way too high.

The fact that bad chosen hyperparameters can result in NaNs inside the gradients calculation threw me really off, I'm closing this now! Thanks for the pointer!

araffin commented 5 years ago

Good to know, what was the magnitude of the entropy coefficient? (usually it is around 0.01 or smaller for PPO/A2C)

jkuball commented 5 years ago

In the way of aimlessly testing parameters it ended up set to 0.5 which is arguably way too high. I used this blogpost to familiarize myself with the usual parameter ranges now. Maybe it's good to add something like "usually between x and y" to the documentation for all parameters? (On the other hand, this is not the duty of stable-baselines, but it might be helpful for beginners.)

araffin commented 5 years ago

Maybe it's good to add something like "usually between x and y" to the documentation for all parameters?

I would rather recommend looking at:

hyperparameters from the paper
tuned hyperparameters present in the rl zoo

rather than having a pre-defined range.

Also, you should use at first automatic parameter tuning (available in the rl zoo) which saves a lot of effort compared to tuning by hand ;).

this is not the duty of stable-baselines

I agree that this is not the duty of SB. And if you change the default hyperparams, you should know what you are doing.

hill-a / stable-baselines

Possible numerical Instability of gradient calculation in PPO2 (?) #340