hanruihua / rl_rvo_nav

The source code of the [RA-L] paper "Reinforcement Learned Distributed Multi-Robot Navigation with Reciprocal Velocity Obstacle Shaped Rewards"
MIT License
174 stars 31 forks source link

The successful rate is 0.00% after second stage #6

Closed sundyCoder closed 2 years ago

sundyCoder commented 2 years ago

Hi,

I follow the experiment in the README.md, the first stage is normal. However, after training in a circle scenario with 10 robots (python train_process_s2.py), the success rate is 0.00%. The experimental log is shown below:

.... time cost in one epoch 11.53249478340149 estimated remain time 0.009610412319501242 hours current epoch 1998 The reward in this epoch: min [-81.33, -94.36, -91.05, -72.8, -132.44, -130.41, -156.77, -158.99, -124.27, -80.02] mean [-40.36, -56.27, -42.8, -50.67, -109.06, -70.3, -99.54, -83.21, -55.97, -49.13] max [-10.39, -35.73, -0.65, -26.63, -84.51, -13.68, -42.53, -23.88, -0.79, -28.84] Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. time cost in one epoch 11.119003295898438 estimated remain time 0.00617722405327691 hours current epoch 1999 The reward in this epoch: min [-70.02, -77.01, -80.48, -62.08, -86.05, -113.78, -55.8, -77.18, -93.87, -111.82] mean [-50.08, -50.29, -50.64, -44.62, -43.91, -60.43, -39.81, -40.26, -59.32, -59.31] max [-31.91, -34.94, -24.68, -0.93, -0.76, -25.75, -17.74, -19.18, -37.56, -29.54] Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. time cost in one epoch 11.041501760482788 estimated remain time 0.00306708382235633 hours current epoch 2000 The reward in this epoch: min [-67.57, -85.75, -105.89, -89.92, -103.05, -179.82, -113.64, -159.73, -124.8, -111.59] mean [-48.79, -51.79, -55.82, -50.35, -54.2, -71.38, -42.5, -52.12, -67.09, -56.98] max [-30.68, -27.21, -35.23, -26.13, -17.7, -0.68, -0.56, -0.53, -29.92, -29.35] Policy Test Start ! Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. Early stopping at step 0 due to reaching max kl. time cost in one epoch 32.75892734527588 estimated remain time 0.0 hours policy_name: r10_0_2000 successful rate: 0.00% average EpLen: 0 std length 0 average speed: 0.96 std speed 0.05

hanruihua commented 2 years ago

Hi 2000 epoch is the maximum trainging length for the policy. However, the best result may occur during epoch 1200~1500. The successful rate of the model will be tested and recorded every 50 epoches by default. You can select the best model as result.

In addition, the _rewardparameter is also an important parameter that may influence the training results. You can try different values to improve the performance.