mrwangyou / DBRL

A Gym Dogfighting Simulation Benchmark for Reinforcement Learning Research
80 stars 13 forks source link

a bug while training dogfight algorithm #1

Closed fj2250109 closed 1 year ago

fj2250109 commented 1 year ago

I follow the steps in the Tutorial to start my training. Initially, everything seemed to be alright, but later, the training process was terminated and raise a bug:

Traceback (most recent call last):

File "XXX\anaconda\envs\JSB_DF_GYM\share\JSBSim\DBRL-main\src\models\sac_jsbsim.py", line 49, in <module> 
    model.learn(total_timesteps=10000000, log_interval=1) 
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\sac\sac.py", line 309, in learn
    return super().learn( 
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\off_policy_algorithm.py", line 356, in learn
    rollout = self.collect_rollouts(  
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\off_policy_algorithm.py", line 586, in collect_rollouts 
    actions, buffer_actions = self._sample_action(learning_starts, action_noise, env.num_envs)
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\off_policy_algorithm.py", line 417, in _sample_action
    unscaled_action, _ = self.predict(self._last_obs, deterministic=False)
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\base_class.py", line 632, in predict
    return self.policy.predict(observation, state, episode_start, deterministic)  
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\policies.py", line 336, in predict
    actions = self._predict(observation, deterministic=deterministic) 
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\sac\policies.py", line 356, in _predict
    return self.actor(observation, deterministic) 
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs) 
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\sac\policies.py", line 177, in forward
    return self.action_dist.actions_from_params(mean_actions, log_std, deterministic=deterministic, **kwargs) 
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\distributions.py", line 179, in actions_from_params
    self.proba_distribution(mean_actions, log_std)
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\distributions.py", line 211, in proba_distribution
    super().proba_distribution(mean_actions, log_std) 
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\stable_baselines3\common\distributions.py", line 153, in proba_distribution
    self.distribution = Normal(mean_actions, action_std)  
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\torch\distributions\normal.py", line 54, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
File "XXX\anaconda\envs\jsb_df_gym\lib\site-packages\torch\distributions\distribution.py", line 55, in __init__
    raise ValueError(   
ValueError: Expected parameter loc (Tensor of shape (1, 4)) of distribution Normal(loc: torch.Size([1, 4]), scale: torch.Size([1, 4])) to satisfy the constraint Real(), but found invalid values:  
tensor([[nan, nan, nan, nan]], device='cuda:0')

This problem recurs steadily every training session. So what is the potential cause of this bug? How should I circumvent this problem?

mrwangyou commented 1 year ago

Hi @fj2250109

Thank you for your interest in our work. I am currently testing the JSBSim environment and no error occurs so that I can't fix it up. Could you please detail the training process which the bug was raised? For example, how long does it takes from when you start training until getting an error? Maybe a short screen recording would help a lot.

mrwangyou commented 1 year ago

@fj2250109 I think this error may be caused by battlefield restrictions. I just updated the code to normalize the observation space and update termination condition. I tried to train the new version of SAC for about 20 hours (~6'000'000 timesteps) and there is no error occurs then. You could update the code and give it a try. Tell me if you find the same bug with the new version. Thank you for your feedback.