Random ValueError - Githubissues

Cdfghglz commented 2 years ago

Hi

I have been training with a custom robot based on the a1 example. I repeatedly get the following error, random number of seconds into the training:

Traceback (most recent call last):
  File "legged_gym/scripts/train.py", line 47, in <module>
    train(args)
  File "legged_gym/scripts/train.py", line 43, in train
    ppo_runner.learn(num_learning_iterations=train_cfg.runner.max_iterations, init_at_random_ep_len=True)
  File "/home/pr_admin/repos/rsl_rl/rsl_rl/runners/on_policy_runner.py", line 107, in learn
    actions = self.alg.act(obs, critic_obs)
  File "/home/pr_admin/repos/rsl_rl/rsl_rl/algorithms/ppo.py", line 94, in act
    self.transition.actions = self.actor_critic.act(obs).detach()
  File "/home/pr_admin/repos/rsl_rl/rsl_rl/modules/actor_critic.py", line 124, in act
    self.update_distribution(observations)
  File "/home/pr_admin/repos/rsl_rl/rsl_rl/modules/actor_critic.py", line 121, in update_distribution
    self.distribution = Normal(mean, mean*0. + self.std)
  File "/home/pr_admin/.local/lib/python3.6/site-packages/torch/distributions/normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/pr_admin/.local/lib/python3.6/site-packages/torch/distributions/distribution.py", line 56, in __init__
    f"Expected parameter {param} "
ValueError: Expected parameter loc (Tensor of shape (4096, 12)) of distribution Normal(loc: torch.Size([4096, 12]), scale: torch.Size([4096, 12])) to satisfy the constraint Real(), but found invalid values:
tensor([[-0.3947, -0.7582,  0.0545,  ..., -0.0636, -0.6433, -0.7390],
        [ 0.8675,  2.4356,  0.0706,  ...,  1.3473,  0.9501,  0.3461],
        [ 0.1058, -2.1669,  0.2811,  ...,  0.1533, -0.2502,  0.6426],
        ...,
        [ 0.3339, -0.1643, -0.0863,  ...,  0.4542,  0.7566, -1.9923],
        [-0.5428, -1.2139, -0.6498,  ...,  0.0080,  1.8390,  0.1338],
        [-0.3889, -0.3290,  0.1571,  ..., -0.0942, -1.7548,  0.1372]],
       device='cuda:0')

Any idea where it could come from? Memory is ok and it does not happen with anymal nor a1.

Thanks!

Cdfghglz commented 2 years ago

Found out that some of the observation means are nan: [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],, any hints why this could happen?

EricVoll commented 2 years ago

I had something similar like this once because at some point some rotations became singular and that "injected" NaN values into the system, from then on everything is NaN.

ShangqunYu commented 2 years ago

Having the Identical issue.

Traceback (most recent call last):
  File "legged_gym/scripts/train.py", line 47, in <module>
    train(args)
  File "legged_gym/scripts/train.py", line 43, in train
    ppo_runner.learn(num_learning_iterations=train_cfg.runner.max_iterations, init_at_random_ep_len=True)
  File "/home/simon/Downloads/rsl_rl/rsl_rl/runners/on_policy_runner.py", line 132, in learn
    mean_value_loss, mean_surrogate_loss = self.alg.update()
  File "/home/simon/Downloads/rsl_rl/rsl_rl/algorithms/ppo.py", line 131, in update
    self.actor_critic.act(obs_batch, masks=masks_batch, hidden_states=hid_states_batch[0])
  File "/home/simon/Downloads/rsl_rl/rsl_rl/modules/actor_critic.py", line 124, in act
    self.update_distribution(observations)
  File "/home/simon/Downloads/rsl_rl/rsl_rl/modules/actor_critic.py", line 121, in update_distribution
    self.distribution = Normal(mean, mean*0. + self.std)
  File "/home/simon/miniconda3/envs/rlgpu/lib/python3.8/site-packages/torch/distributions/normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/simon/miniconda3/envs/rlgpu/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (6144, 18)) of distribution Normal(loc: torch.Size([6144, 18]), scale: torch.Size([6144, 18])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

vlapdecab commented 2 years ago

I am having the same issue; have you found a way to solve it?

Traceback (most recent call last):
  File "train.py", line 47, in <module>
    train(args)
  File "train.py", line 43, in train
    ppo_runner.learn(num_learning_iterations=train_cfg.runner.max_iterations, init_at_random_ep_len=True)
  File "/home/robin-lab/rsl_rl/rsl_rl/runners/on_policy_runner.py", line 107, in learn
    actions = self.alg.act(obs, critic_obs)
  File "/home/robin-lab/rsl_rl/rsl_rl/algorithms/ppo.py", line 94, in act
    self.transition.actions = self.actor_critic.act(obs).detach()
  File "/home/robin-lab/rsl_rl/rsl_rl/modules/actor_critic.py", line 125, in act
    self.update_distribution(observations)
  File "/home/robin-lab/rsl_rl/rsl_rl/modules/actor_critic.py", line 122, in update_distribution
    self.distribution = Normal(mean, mean*0. + self.std)
  File "/home/robin-lab/anaconda3/envs/isaac/lib/python3.8/site-packages/torch/distributions/normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/robin-lab/anaconda3/envs/isaac/lib/python3.8/site-packages/torch/distributions/distribution.py", line 56, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (4096, 10)) of distribution Normal(loc: torch.Size([4096, 10]), scale: torch.Size([4096, 10])) to satisfy the constraint Real(), but found invalid values:
tensor([[ 0.0094,  0.0462,  0.1332,  ..., -0.0147, -0.0388, -0.1170],
        [-0.0048,  0.0742,  0.1166,  ...,  0.0325, -0.0363,  0.0715],
        [ 0.0712,  0.0424,  0.1967,  ..., -0.0338, -0.0136, -0.0345],
        ...,
        [-0.0758,  0.1650,  0.0851,  ..., -0.0418,  0.0612,  0.1154],
        [-0.0100,  0.0359,  0.1483,  ..., -0.1286,  0.0016,  0.0814],
        [ 0.0160, -0.0444,  0.2055,  ...,  0.0365, -0.0442, -0.1798]],
       device='cuda:0')

This is always caused by an agent having all observation values being nan: [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan] Any help would be much appreciated! Thanks!

vlapdecab commented 2 years ago

For those of you who still struggle with the issue of nan values, this is my solution.

I tested training the model with one agent only (instead of the 4096 by default) to see what was happening to my biped. After a few iterations it fell on the ground but the simulation did not stop and the robot was basically trying to push itself back on its feet with huge actions on the motors. This made the biped start jumping very high and induced the nan values.

To test why the simulation did not stop when the robot was on the ground I ran the _testenv.py file and realised that I did not specify all of the parts that should stop the simulation when on the ground. I added them and also reduced the max_angular_velocity and max_linear_velocity to 10 (instead of 1000.). Basically all the changes are in the class asset of the _your_robotconfig.py file:

class asset( LeggedRobotCfg.asset ):
        file = '{LEGGED_GYM_ROOT_DIR}/resources/robots/your_robot/urdf/your_robot.urdf'
        name = "your_robot"
        foot_name = 'your_foot'
        terminate_after_contacts_on = ['torso','right_arm','etc'] # make sure to specify all the parts terminating the sim
        flip_visual_attachments = False
        self_collisions = 1 # 1 to disable, 0 to enable...bitwise filter

        max_angular_velocity = 10. # by default set to 1000. in legged_robot_config.py
        max_linear_velocity = 10.

Hope this helps!

kc-ustc commented 8 months ago

I also met this problem.I find that in the XXXrunner.py ,the function 'learn' call 'get_observation()' in legged_robot.py ,but the return value obs may get NAN in some env(not all env),so strange! I set num_envs = 10,the problem seems to be solved,but the train will be broken when itr around 6000.

HaronW commented 4 months ago

I solve the same issue by checking my URDF file and tuning the settings in <joint ... <limit ... /> />

leggedrobotics / legged_gym

Random ValueError #16