mit-acl / gym-collision-avoidance

MIT License
242 stars 74 forks source link

The PPOCADRLPolicy could not load the mfe_network? #16

Closed zhangcj13 closed 1 year ago

zhangcj13 commented 1 year ago

Hello I want to use the PPOCADRLPolicy , but the baselines package does not have mfe_network. Is there a another baselines pack, or where I can find the train and test code? thanks

/ from gym_collision_avoidance.envs.policies.PPOCADRLPolicy import PPOCADRLPolicy File "/home/cjzhang/workspace/robot/YaoGuang/rl_collision_avoidance_syou/gym_collision_avoidance/envs/policies/PPOCADRLPolicy.py", line 6, in from baselines.ppo2.mfe_network import mfe_network ModuleNotFoundError: No module named 'baselines.ppo2.mfe_network'

mfe7 commented 1 year ago

We experimented internally with ppo but never released it, so that policy is just a remnant and not supported

zhangcj13 commented 1 year ago

@mfe7 thanks. I change the GA3C code's tensorflow model to pytorch than training the model. and using ppo to training, and RScore in GA3C is 0.8. but the results with [2,3,4]agents 500 testcases shown in figure is not so well, could you give some advise about it or is there any tricks to improve the performance? image

[Time: 302609] [Episode: 1355433 Score: 1.0000] [RScore: 0.8417 RPPS: 862] [PPS: 719 TPS: 6] [NT: 2 NP: 2 NA: 32] [Time: 302610] [Episode: 1355434 Score: 1.0000] [RScore: 0.8417 RPPS: 861] [PPS: 719 TPS: 6] [NT: 2 NP: 2 NA: 32] [Time: 302610] [Episode: 1355435 Score: 0.8730] [RScore: 0.8415 RPPS: 859] [PPS: 719 TPS: 6] [NT: 2 NP: 2 NA: 32] [Time: 302610] [Episode: 1355436 Score: 1.0000] [RScore: 0.8416 RPPS: 858] [PPS: 719 TPS: 6] [NT: 2 NP: 2 NA: 32] [Time: 302610] [Episode: 1355437 Score: 1.0000] [RScore: 0.8416 RPPS: 858] [PPS: 719 TPS: 6] [NT: 2 NP: 2 NA: 32] [Time: 302610] [Episode: 1355438 Score: 0.5833] [RScore: 0.8419 RPPS: 859] [PPS: 719 TPS: 6] [NT: 2 NP: 2 NA: 32] [Time: 302610] [Episode: 1355439 Score: 1.0000] [RScore: 0.8423 RPPS: 858] [PPS: 719 TPS: 6] [NT: 2 NP: 2 NA: 32]

mfe7 commented 1 year ago

it's hard for me to know why ppo would give a worse policy than ga3c. if i were doing this again today i probably would use a proper multiagent RL algorithm for training, since the way this code is set up, all the agents' experiences get thrown into the same batch - since the reward fn is written for one agent w/o considering its effect on others - it's not clear if that is a good way to capture the desired joint behaviors