ikostrikov / pytorch-a3c

PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".
MIT License
1.23k stars 279 forks source link

I cannot train with your recent pytorch-a3c #36

Closed aizawatkm closed 7 years ago

aizawatkm commented 7 years ago

I am very interested in you pytorch-a3c because of its compactness and very simple structure. I tried to follow your excellent work, but I cannot run successfully after struggling more than a month. I will be very happy if you could give me any helpful information.

Following your readme, I installed most recent PyTorch, and clone your pytorch-a3c. (3 times cloned)

(1) "PongDeterministic-v4" was refused with next message. (anaconda-2.4.0) user-no-MacBook-Pro:iko3 user$ OMP_NUM_THREADS=1 python main.py --env-name "PongDeterministic-v4" --num-processes 16 [2017-10-07 19:57:18,923] Making new env: PongDeterministic-v4 Traceback (most recent call last):   File "main.py", line 53, in     env = create_atari_env(args.env_name)   File "/Users/user/iko3/envs.py", line 9, in create_atari_env     env = gym.make(env_id)   File "/Users/user/gym/gym/envs/registration.py", line 126, in make     return registry.make(id)   File "/Users/user/gym/gym/envs/registration.py", line 90, in make     spec = self.spec(id)   File "/Users/user/gym/gym/envs/registration.py", line 110, in spec     raise error.DeprecatedEnv('Env {} not found (valid versions include {})'.format(id, matching_envs)) gym.error.DeprecatedEnv: Env PongDeterministic-v4 not found (valid versions include ['PongDeterministic-v3', 'PongDeterministic-v0'])

(2)so, I replaced "PongDeterministic-v4" to "PongDeterministic-v3" Program started but score keep -21 more than 1day.

(3)I printed out prob and action in test.py. prob is changing only last digit, and action changed 0 to 5. By the way, I cannot understand the length of prob. I think pong has 3 actions(left,stay,right). Outputs are as follows.   Last line repeated more than one day with same score -21 My machine environment is Mac 10.12.6(Sierra) core i7

 0.1666  0.1667  0.1666  0.1666  0.1667  0.1667 [torch.FloatTensor of size 1x6] , array([[4]])) (Variable containing:  0.1666  0.1667  0.1666  0.1666  0.1667  0.1667 [torch.FloatTensor of size 1x6] , array([[5]])) (Variable containing:  0.1666  0.1667  0.1666  0.1666  0.1667  0.1667 [torch.FloatTensor of size 1x6] , array([[5]])) (Variable containing:  0.1667  0.1667  0.1667  0.1666  0.1667  0.1667 [torch.FloatTensor of size 1x6] , array([[1]])) Time 00h 01m 08s, episode reward -21.0, episode length 764

ikostrikov commented 7 years ago

gym.error.DeprecatedEnv: Env PongDeterministic-v4 not found (valid versions include ['PongDeterministic-v3', 'PongDeterministic-v0'])

You are using an old version of gym. Use PongDeterministic-v3, or update your gym.

aizawatkm commented 7 years ago

Thank you very much for your early reply.

Following your comment I update gym and could execute PongDeterministic-v4 with no error. But reward keep -21 and no change for more than 3hours moving figures of prob of test.py are same. I tried dump state of test.py and display them. paddle and ball is moving.

If there would be any idea or checkpoints to improve this, please let me know. Thank you.

ikostrikov commented 7 years ago

I changed one thing. Let me know if it works now.

jakezhaojb commented 7 years ago

Ilya, I have the same issue with that. Could you indicate the version of the libaraies?

ikostrikov commented 7 years ago

@jakezhaojb I've just tested in the latest versions. What problem do you have specifically?

ikostrikov commented 7 years ago

@aizawatkm it might no work because you try to run more threads than your cpu physically has. Try to reduce --num-processes parameter.

aizawatkm commented 7 years ago

Thank you very much for your kind comments.

I tried following your moments, and results are as follows.

(1) Clone new version of pytorch-a3c.py Output no change. prob values all move 0.1666~0.1667 for several hours.

(2) num-processes 4 Covert from num-processes 16 to 4 in start command Output is same as (1)

Thank you again for your assist.

ikostrikov commented 7 years ago

@aizawatkm @jakezhaojb Do you use Python 2? It looks like it doesn't work with Python 2 for me either.

aizawatkm commented 7 years ago

Thank you very much for your comment. Sorry to say, I used python 2.7.13.
I should be aware of it earlier reffering your github comments. Trying future commands gave no effect, I decided to use(my first try) python3 and finally succeeded to run your pytorch-a3c. However, reward move around between 2 values of -2 and -21, training does not progress. Because my mac (core i7) has 4cpu, I used 4 for num_processes.

(anaconda3-2.4.1) user-no-MacBook-Pro:iko4 user$ OMP_NUM_THREADS=1 python main.py --env-name "PongDeterministic-v4" --num-processes 4 [2017-10-12 20:15:14,179] Making new env: PongDeterministic-v4 [2017-10-12 20:15:14,421] Making new env: PongDeterministic-v4 [2017-10-12 20:15:14,423] Making new env: PongDeterministic-v4 [2017-10-12 20:15:14,425] Making new env: PongDeterministic-v4 [2017-10-12 20:15:14,428] Making new env: PongDeterministic-v4 [2017-10-12 20:15:14,442] Making new env: PongDeterministic-v4 Time 00h 00m 07s, episode reward -21.0, episode length 764 Time 00h 01m 08s, episode reward -2.0, episode length 100 Time 00h 02m 09s, episode reward -2.0, episode length 104 Time 00h 03m 10s, episode reward -2.0, episode length 100 Time 00h 04m 11s, episode reward -2.0, episode length 105 Time 00h 05m 19s, episode reward -21.0, episode length 825 Time 00h 06m 20s, episode reward -2.0, episode length 101 Time 00h 07m 21s, episode reward -2.0, episode length 111 Time 00h 08m 28s, episode reward -21.0, episode length 764 Time 00h 09m 35s, episode reward -21.0, episode length 764 Time 00h 10m 42s, episode reward -21.0, episode length 764 Time 00h 11m 49s, episode reward -21.0, episode length 764 Time 00h 12m 56s, episode reward -21.0, episode length 764 Time 00h 14m 08s, episode reward -21.0, episode length 1324

ikostrikov commented 7 years ago

With 4 cores it just takes much longer to train it. The length of the episodes increases, for Pong it also means that it gets better.

aizawatkm commented 7 years ago

Your comment is strictly right.

I got a good result after with longer training time ,as follows. Issue resolved. Training time is much shorter than Tensorflow a3c application. Nice and Great. Thank you again for your very kind and patient assistance.

Time 02h 06m 39s, episode reward 20.0, episode length 1799 Time 02h 07m 54s, episode reward 21.0, episode length 1703 Time 02h 09m 08s, episode reward 21.0, episode length 1698 Time 02h 10m 23s, episode reward 21.0, episode length 1698 Time 02h 11m 38s, episode reward 20.0, episode length 1722 Time 02h 12m 53s, episode reward 20.0, episode length 1745 Time 02h 14m 08s, episode reward 20.0, episode length 1701 Time 02h 15m 28s, episode reward 21.0, episode length 1698 Time 02h 16m 45s, episode reward 21.0, episode length 1698 Time 02h 18m 03s, episode reward 21.0, episode length 1698 Time 02h 19m 22s, episode reward 21.0, episode length 1698

ikostrikov commented 7 years ago

I'm glad to help!

By the way, in my experience a2c is much easier to work with: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr

Moreover, this repo supports not only atari but also environments for continuous control: mujoco and pybullet.