PPO ValueError: The parameter loc has invalid values

Hi, I am having difficulties using PPO stable baselines 3 on my custom environment. First, I have checked my environment using check_env(env) and there are no problems reported by it. I also used env = VecCheckNan(env, raise_exception=True) when moving from one environment to multiple environments. However, this problem is consistent weather it is one environment or multiple environments. The agent starts training and collecting samples, as soon as the roll out starts ( the batch is sent to the GPU) I get the error shown below. What makes it even stranger, the fact that if I use A2C, it runs without problems. However, I need to use PPO as it performed very well on a very similar problem. I should also mention that my actions space and observation space is huge (size = 9000) but I ran very similar problem (size = 3000) without any problems and PPO was able to solve my problem.

 FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  th.nn.utils.clip_grad_norm_(self.policy.parameters(), self.max_grad_norm)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2761030/1396436599.py in <module>
      1 # Train the model
      2 if (train):
----> 3     model.learn(total_timesteps=RL_total_time_steps,tb_log_name=log_name)

~/.local/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py in learn(self, total_timesteps, callback, log_interval, eval_env, eval_freq, n_eval_episodes, tb_log_name, eval_log_path, reset_num_timesteps)
    299     ) -> "PPO":
    300 
--> 301         return super(PPO, self).learn(
    302             total_timesteps=total_timesteps,
    303             callback=callback,

~/.local/lib/python3.8/site-packages/stable_baselines3/common/on_policy_algorithm.py in learn(self, total_timesteps, callback, log_interval, eval_env, eval_freq, n_eval_episodes, tb_log_name, eval_log_path, reset_num_timesteps)
    255                 self.logger.dump(step=self.num_timesteps)
    256 
--> 257             self.train()
    258 
    259         callback.on_training_end()

~/.local/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py in train(self)
    199                     self.policy.reset_noise(self.batch_size)
    200 
--> 201                 values, log_prob, entropy = self.policy.evaluate_actions(rollout_data.observations, actions)
    202                 values = values.flatten()
    203                 # Normalize advantage

~/.local/lib/python3.8/site-packages/stable_baselines3/common/policies.py in evaluate_actions(self, obs, actions)
    660         """
    661         latent_pi, latent_vf, latent_sde = self._get_latent(obs)
--> 662         distribution = self._get_action_dist_from_latent(latent_pi, latent_sde)
    663         log_prob = distribution.log_prob(actions)
    664         values = self.value_net(latent_vf)

~/.local/lib/python3.8/site-packages/stable_baselines3/common/policies.py in _get_action_dist_from_latent(self, latent_pi, latent_sde)
    622 
    623         if isinstance(self.action_dist, DiagGaussianDistribution):
--> 624             return self.action_dist.proba_distribution(mean_actions, self.log_std)
    625         elif isinstance(self.action_dist, CategoricalDistribution):
    626             # Here mean_actions are the logits before the softmax

~/.local/lib/python3.8/site-packages/stable_baselines3/common/distributions.py in proba_distribution(self, mean_actions, log_std)
    150         """
    151         action_std = th.ones_like(mean_actions) * log_std.exp()
--> 152         self.distribution = Normal(mean_actions, action_std)
    153         return self
    154 

/usr/lib/python3/dist-packages/torch/distributions/normal.py in __init__(self, loc, scale, validate_args)
     48         else:
     49             batch_shape = self.loc.size()
---> 50         super(Normal, self).__init__(batch_shape, validate_args=validate_args)
     51 
     52     def expand(self, batch_shape, _instance=None):

/usr/lib/python3/dist-packages/torch/distributions/distribution.py in __init__(self, batch_shape, event_shape, validate_args)
     51                     continue  # skip checking lazily-constructed args
     52                 if not constraint.check(getattr(self, param)).all():
---> 53                     raise ValueError("The parameter {} has invalid values".format(param))
     54         super(Distribution, self).__init__()
     55 

ValueError: The parameter loc has invalid values

*** Code

env = gym.make(...)
check_env(env)
policy_kwargs = dict(net_arch=[dict(pi=[128,128], vf=[128,128])])
model = RL_Algorithm('MlpPolicy', env, verbose=1,gamma=RL_gamma,tensorboard_log="./logs/",                     policy_kwargs=policy_kwargs,n_steps=n_steps,batch_size=batch_size)
model.learn(total_timesteps=RL_total_time_steps,tb_log_name=log_name)

System Info Tensorflow version = 2.6.0 Keras version = 2.6.0 Stable baseline version = 1.2.0 OpenAI Gym version = 0.19.0 Torch version = 1.9.1 Python version = 3.8.10 (default, Sep 28 2021, 16:10:42) [GCC 9.3.0] NVIDIA GeForce RTX 3090

hill-a / stable-baselines

PPO ValueError: The parameter loc has invalid values #1143