Hi,
I am having difficulties using PPO stable baselines 3 on my custom environment.
First, I have checked my environment using check_env(env) and there are no problems reported by it.
I also used env = VecCheckNan(env, raise_exception=True) when moving from one environment to multiple environments.
However, this problem is consistent weather it is one environment or multiple environments.
The agent starts training and collecting samples, as soon as the roll out starts ( the batch is sent to the GPU) I get the error shown below.
What makes it even stranger, the fact that if I use A2C, it runs without problems.
However, I need to use PPO as it performed very well on a very similar problem.
I should also mention that my actions space and observation space is huge (size = 9000) but I ran very similar problem (size = 3000) without any problems and PPO was able to solve my problem.
FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
th.nn.utils.clip_grad_norm_(self.policy.parameters(), self.max_grad_norm)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_2761030/1396436599.py in <module>
1 # Train the model
2 if (train):
----> 3 model.learn(total_timesteps=RL_total_time_steps,tb_log_name=log_name)
~/.local/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, eval_env, eval_freq, n_eval_episodes, tb_log_name, eval_log_path, reset_num_timesteps)
299 ) -> "PPO":
300
--> 301 return super(PPO, self).learn(
302 total_timesteps=total_timesteps,
303 callback=callback,
~/.local/lib/python3.8/site-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, eval_env, eval_freq, n_eval_episodes, tb_log_name, eval_log_path, reset_num_timesteps)
255 self.logger.dump(step=self.num_timesteps)
256
--> 257 self.train()
258
259 callback.on_training_end()
~/.local/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py in train(self)
199 self.policy.reset_noise(self.batch_size)
200
--> 201 values, log_prob, entropy = self.policy.evaluate_actions(rollout_data.observations, actions)
202 values = values.flatten()
203 # Normalize advantage
~/.local/lib/python3.8/site-packages/stable_baselines3/common/policies.py in evaluate_actions(self, obs, actions)
660 """
661 latent_pi, latent_vf, latent_sde = self._get_latent(obs)
--> 662 distribution = self._get_action_dist_from_latent(latent_pi, latent_sde)
663 log_prob = distribution.log_prob(actions)
664 values = self.value_net(latent_vf)
~/.local/lib/python3.8/site-packages/stable_baselines3/common/policies.py in _get_action_dist_from_latent(self, latent_pi, latent_sde)
622
623 if isinstance(self.action_dist, DiagGaussianDistribution):
--> 624 return self.action_dist.proba_distribution(mean_actions, self.log_std)
625 elif isinstance(self.action_dist, CategoricalDistribution):
626 # Here mean_actions are the logits before the softmax
~/.local/lib/python3.8/site-packages/stable_baselines3/common/distributions.py in proba_distribution(self, mean_actions, log_std)
150 """
151 action_std = th.ones_like(mean_actions) * log_std.exp()
--> 152 self.distribution = Normal(mean_actions, action_std)
153 return self
154
/usr/lib/python3/dist-packages/torch/distributions/normal.py in __init__(self, loc, scale, validate_args)
48 else:
49 batch_shape = self.loc.size()
---> 50 super(Normal, self).__init__(batch_shape, validate_args=validate_args)
51
52 def expand(self, batch_shape, _instance=None):
/usr/lib/python3/dist-packages/torch/distributions/distribution.py in __init__(self, batch_shape, event_shape, validate_args)
51 continue # skip checking lazily-constructed args
52 if not constraint.check(getattr(self, param)).all():
---> 53 raise ValueError("The parameter {} has invalid values".format(param))
54 super(Distribution, self).__init__()
55
ValueError: The parameter loc has invalid values
System Info
Tensorflow version = 2.6.0
Keras version = 2.6.0
Stable baseline version = 1.2.0
OpenAI Gym version = 0.19.0
Torch version = 1.9.1
Python version = 3.8.10 (default, Sep 28 2021, 16:10:42)
[GCC 9.3.0]
NVIDIA GeForce RTX 3090
Hi, I am having difficulties using PPO stable baselines 3 on my custom environment. First, I have checked my environment using
check_env(env)
and there are no problems reported by it. I also usedenv = VecCheckNan(env, raise_exception=True)
when moving from one environment to multiple environments. However, this problem is consistent weather it is one environment or multiple environments. The agent starts training and collecting samples, as soon as the roll out starts ( the batch is sent to the GPU) I get the error shown below. What makes it even stranger, the fact that if I use A2C, it runs without problems. However, I need to use PPO as it performed very well on a very similar problem. I should also mention that my actions space and observation space is huge (size = 9000) but I ran very similar problem (size = 3000) without any problems and PPO was able to solve my problem.*** Code
System Info Tensorflow version = 2.6.0 Keras version = 2.6.0 Stable baseline version = 1.2.0 OpenAI Gym version = 0.19.0 Torch version = 1.9.1 Python version = 3.8.10 (default, Sep 28 2021, 16:10:42) [GCC 9.3.0] NVIDIA GeForce RTX 3090