openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.64k stars 4.86k forks source link

NaN values in acktr #127

Closed lukashermann closed 6 years ago

lukashermann commented 7 years ago

Hi everyone, I'm trying to use continuous acktr to learn to reach a target with a mujoco simulation of the jaco arm. I use exactly the same hyperparameters as for the reacher env and acktr definitely learns something meaningful, the reward goes up and I can also see it when I render the frames.

The problem is, that after some 2000-3000 iterations, the algorithm starts to produce nan values.

The log at the time when it starts to happen looks as follows:


Iteration 3025
kl just right!

| EVAfter   | 0.984      |
| EVBefore  | 0.976      |
| EpLenMean | 200        |
| EpRewMean | -8.5       |
| EpRewSEM  | 0.82       |
| KL        | 0.00148061 |

Iteration 3026 
kl too low

| EVAfter   | 0.984       |
| EVBefore  | 0.98        |
| EpLenMean | 200         |
| EpRewMean | -7.31       |
| EpRewSEM  | 0.613       |
| KL        | 0.000913428 |

Iteration 3027
kl just right!

| EVAfter   | 0.98     |
| EVBefore  | 0.976    |
| EpLenMean | 200      |
| EpRewMean | -8.92    |
| EpRewSEM  | 0.937    |
| KL        | nan      |

Then of course the nans start to spread and everything becomes nan. Does anyone have an idea what could cause such behaviour and what to do against it?

Breakend commented 7 years ago

I've experienced this as well. Suggest swapping out those elu activations for relu or tanh. It's a bit hacky, but alternatives would be to lower the learning rate maybe? Except, my problem is I had NaN's in the value function not in the KL, so might be the policy

Breakend commented 7 years ago

nvm that didn't fix it for me either..

valldabo2 commented 7 years ago

+1

I also get NaN in the value function. Using actkr.disc though.

mklissa commented 7 years ago

I get the same phenomenon straight from iteration 0 on Ant-v1 with different random seeds :

********** Iteration 0 ************
kl too low
----------------------------------
| EVAfter           | nan        |
| EVBefore          | 0.000411   |
| EpLenMean         | 146        |
| EpRewMean         | -151       |
| EpRewSEM          | 58.4       |
| KL                | 4.1343e-07 |
| Timesteps so far  | 3366       |
| Timesteps/sec     | 298        |
----------------------------------
********** Iteration 1 ************
kl just right!
--------------------------------
| EVAfter           | nan      |
| EVBefore          | nan      |
| EpLenMean         | 82.4     |
| EpRewMean         | -83.8    |
| EpRewSEM          | 33.2     |
| KL                | nan      |
| Timesteps so far  | 5921     |
| Timesteps/sec     | 323      |
--------------------------------
mansimov commented 7 years ago

Thanks @lukashermann , @Breakend @valldabo2 @mklissa for reporting the issue !

Since @valldabo2 reported the Nan issue in Atari experiments, and I just replicated Atari experiments without trouble in this thread https://github.com/openai/baselines/issues/130 maybe there is some system or tensorflow related issue. I ran those experiments with anaconda3 (python version 3.5.2, anaconda version 4.2.0 (64-bit) and GCC 4.4.7) and tensorflow (version 1.3). Also how many cpu cores does your machine have ?

Btw the code I ran (same line by line as in my machine) available here https://github.com/emansim/baselines-mansimov (exactly as openai/baselines with some NYU related stuff that you can ignore). Can you try rerunning it ?

Also @lukashermann mentioned to me that he ran https://github.com/emansim/acktr and didn't have Nan issues, although code should be more or less identical

lukashermann commented 7 years ago

@mklissa @valldabo2 I also experienced nan values from the start in the value function training a custom build environment similar to reacher-v1, but with the jaco arm. When I scaled down the environment reward (rew*0.1), it didn't happen anymore, so you can also try and see if this helps. The KL nans are a different kind of problem though.

valldabo2 commented 7 years ago

@emansim FYI: I am using my own environment and not the Atari, I am using the anaconda3, python 3.5, osx64. My rewards are -1, 0 and 1. I am using baselines CnnPolicy, getting nan values but have debugged and confirmed that the observations for the CnnPolicy contain real values.

@lukashermann Okay cool, I will try that. Do you think there is a value range for the CnnPolicies observations?? Or is it just a bug? It works with A2C and the CnnPolicy.

UPDATE: Tried rescaling the reward from -1,0,1 to -0.1,0,0.1 and then it works. @lukashermann

Breakend commented 7 years ago

Hey, for some more info, I I had basically an identical setup to @emansim and had problems with the RoboschoolHumanoid-v1 and RoboschoolHumanoidFlagrun-v1 environments. This was after ~1000 iterations I think. It may be an issue of reward scale causing gradient instability (explosion?), could be related to the gradient projections? Maybe the better way here would not to be use KFAC updates on the Value function approximator, but just plain Adam? Not sure if anyone's tried this

mansimov commented 7 years ago

@lukashermann @valldabo2 I don't recommend scaling down reward function, since it implies that you are changing the definition of RL problem, but if it works well for your problem thats great.

@Breakend We did experiments where we used Adam in Value Function instead of KFAC and it turned out not to work as well compared to using KFAC. See figure 5 a), b) in paper.

Regarding solving NaN's in Value Function, I suggest you try:

  1. If NaN in value function happens at the very beginning, it can happen because of the first 50 cold sgd iterations, before it becomes kronecker-factored trust region. I suggest you set max_grad_norm variable to 1.0 (or play with this hyperparam) in here https://github.com/openai/baselines/blob/master/baselines/acktr/value_functions.py#L25 that will clip the gradients for sgd during first 50 iterations.

  2. Reduce the learning rate and cold learning rate in value function https://github.com/openai/baselines/blob/master/baselines/acktr/value_functions.py#L22 or try annealing down the learning rate.

  3. Change the initialization in the last layer in the value function https://github.com/openai/baselines/blob/master/baselines/acktr/value_functions.py#L16 from U.normc_initializer(1.0) to None.

I am getting NaN issue too in Value Function from @lukashermann experiments after ~3000-4000 iterations (which is a bit weird) and I am investigating it. Thanks !

mansimov commented 6 years ago

Ok I found a small detail in adjusting stepsize that wasn't in baselines code that fixes the NaN issue in @lukashermann Jaco environment and roboschool humanoid @Breakend

Change lines 121-129 in https://github.com/openai/baselines/blob/master/baselines/acktr/acktr_cont.py to

        min_stepsize = np.float32(1e-8)
        max_stepsize = np.float32(1e0)
        # Adjust stepsize
        kl = policy.compute_kl(ob_no, oldac_dist)
        if kl > desired_kl * 2:
            logger.log("kl too high")
            U.eval(tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)))
        elif kl < desired_kl / 2:
            logger.log("kl too low")
            U.eval(tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)))
        else:
            logger.log("kl just right!")

I will create pull request with this fix and other misc small tweaks soon. Thanks for your patience !

lukashermann commented 6 years ago

@emansim I tried it and now the NaNs didn't occur anymore, so it seems that this fixed the problem :) Thank you!

jirenu commented 6 years ago

I'm still getting Nans for a custom environment. This was remedied by scaling the reward down as suggested.