Get Steps 0 @ 0.0 SPS. Loss inf. Stats

jerrodparker20 / adaptive-transformers-in-rl

Adaptive Attention Span for Reinforcement Learning

132 stars 14 forks source link

Get Steps 0 @ 0.0 SPS. Loss inf. Stats #13

Closed ghost closed 3 years ago

ghost commented 4 years ago

Hello, I tried your command but with 16 num_ actors: python train.py --total_steps 10000000 --learning_rate 0.0004 --unroll_length 239 --num_buffers 40 --n_layer 3 --d_inner 1024 --xpid row82 --chunk_size 80 --action_repeat 1 --num_actors 32 --num_learner_threads 1 --sleep_length 5 --atari True

But I got: Steps 0 @ 0.0 SPS. Loss inf. Stats

shaktikshri commented 4 years ago

Hi, you may find this answer useful.

ghost commented 4 years ago

My computer cannot create num_actors 32, so I had to use 16. Also I am using Windows 10, so I made these changes in your train.py at line 731: ctx = mp.get_context("spawn") #'spawn' mp.get_context("fork") free_queue = ctx.Queue() #ctx.SimpleQueue() full_queue = ctx.Queue() #ctx.SimpleQueue()

That might be the reason for my errors.

ghost commented 4 years ago

Could you provide a simpler example such as CartPole or Pendulum? Thanks!

vitusya commented 4 years ago

I have the same issue

shaktikshri commented 4 years ago

@cubicgate yes you might be running into errors at queuing in the actors or dequeuing in the learner. Remember that even with num_actors=1 there are 3 processes running, 1 actor, 1 learner and the main thread. Exceptions in non main threads don't stall the main thread here, so you may want to use prints at appropriate places to verify if you aren't running into an exception in any of the actors/learners. The stats wont be shown until one of your buffers are loaded with trajectory and the learner consumes it. So it might even be worth waiting for some time (depending on your execution platform) before you expect any output.

vitusya commented 4 years ago

ctx = mp.get_context("spawn") # ctx = mp.get_context("fork")

resolved my problems , im on Ubuntu

vitusya commented 4 years ago

My computer cannot create num_actors 32, so I had to use 16. Also I am using Windows 10, so I made these changes in your train.py at line 731: ctx = mp.get_context("spawn") #'spawn' mp.get_context("fork") free_queue = ctx.Queue() #ctx.SimpleQueue() full_queue = ctx.Queue() #ctx.SimpleQueue()

That might be the reason for my errors.

im not sure you will be able to run gym atari or dmlab on windows without issues.

kimbring2 commented 3 years ago

I have a same problem with Python 3.6 version. After changing to 3.7 version, issue is disappeared. It seems like the TorchBeast is only available in Python 3.7 version.