Closed KaixiangLin closed 7 years ago
I've seen this too, but can't reproduce it reliably.
If you restart that one worker, it should join the cluster and start working.
The interface for initializing variables changed a lot between tf 0.11 and 0.12. What's the exact version you're using? (Output of pip show tensorflow
)
Another strange thing: this line
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223, 1 -> 127.0.0.1:12224, 2 -> 127.0.0.1:12225, 3 -> 127.0.0.1:12226, 4 -> 127.0.0.1:12227, 5 -> 127.0.0.1:12228, 6 -> 127.0.0.1:12229, 7 -> 127.0.0.1:12230, 8 -> 127.0.0.1:12231, 9 -> 127.0.0.1:12232, 10 -> 127.0.0.1:12233, 11 -> localhost:12234, 12 -> 127.0.0.1:12235, 13 -> 127.0.0.1:12236, 14 -> 127.0.0.1:12237, 15 -> 127.0.0.1:12238, 16 -> 127.0.0.1:12239, 17 -> 127.0.0.1:12240, 18 -> 127.0.0.1:12241, 19 -> 127.0.0.1:12242, 20 -> 127.0.0.1:12243, 21 -> 127.0.0.1:12244, 22 -> 127.0.0.1:12245, 23 -> 127.0.0.1:12246, 24 -> 127.0.0.1:12247, 25 -> 127.0.0.1:12248, 26 -> 127.0.0.1:12249, 27 -> 127.0.0.1:12250, 28 -> 127.0.0.1:12251, 29 -> 127.0.0.1:12252, 30 -> 127.0.0.1:12253, 31 -> 127.0.0.1:12254}
looks like you have 32 workers, but the train command only asks for 12.
I use 0.11, sorry for the typo, it should be 32, I run 12, 16, 32 workers and I cannot reproduce this reliably either, some runs all threads are working well, some not.
Name: tensorflow
Version: 0.11.0
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/site-packages
Requires: protobuf, numpy, mock, wheel, six
Any idea to fix this? Shouldn't the non-chief worker wait to be executed until the chief worker finish the initialization? According their documentation which I quote here
In the other tasks sv.managed_session() waits for the Model to have been initialized before returning a session to the training code. The non-chief tasks depend on the chief task for initializing the model.
Thanks a lot!
It should definitely work. I don't understand why it sometimes doesn't.
One possible cause: make sure that the parameter server process isn't still running from the last run. Due to something inside Tensorflow, ^C, kill <pid>
, or kill -TERM <pid>
don't kill the parameter server. You have to run kill -9 <pid>
to kill it.
That sounds like your environment is getting into a bad state somehow. It'd
be useful to see the logs from /tmp/universe-*.log
, as well as the docker
logs from the environment and stdout/stderr from your run. The logfiles
generally have all the data to explain why something weird is happening.
On Sun, Jan 8, 2017 at 10:21 AM, Yigit notifications@github.com wrote:
This is something I suffer a lot. I am using 16CPU EC2 and running 8 workers on it. Eventually, my rewards goes to zero immediately and when I connect with vnc, I see no action is taken by worker. After a while, I got this exact error.. I don't have any idea what may cause it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openai/universe-starter-agent/issues/44#issuecomment-271168558, or mute the thread https://github.com/notifications/unsubscribe-auth/AAM7kcVwFcSDRE3-U6HHBnld2kYTt7ysks5rQSkYgaJpZM4LbTkM .
This:
FailedPreconditionError: Attempting to use uninitialized value local/value/b
[[Node: local/value/b/read = Identity[T=DT_FLOAT, _class=["loc:@local/value/b"], _device="/job:worker/replica:0/task:11/cpu:0"](local/value/b)]]
and also this:
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [6] rhs shape= [8]
[[Node: global/action/b/Adam/Assign = Assign[T=DT_FLOAT, _class=["loc:@global/action/b"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](global/action/b/Adam, zeros_22)]]
can be caused by leaving a parameter server running from a previous experiment. The parameter server is one of the processes starter by train.py
, and has --job-name ps
in its arguments.
For some reason, Tensorflow in parameter server mode traps signals, so kill <pid>
andtmux kill-session
don't kill it. You have to use kill -9 <pid>
.
I'll figure out a better solution, but for now try ps auxww | grep python
and kill -9
any ps
jobs, which look like this:
tlb 75227 0.0 0.8 557340092 127644 ?? S 12:54PM 0:04.89 /Users/tlb/anaconda3/envs/tf12/bin/python worker.py --log-dir /tmp/neonrace3 --env-id flashgames.NeonRace-v0 --num-workers 4 --job-name ps
I think the reason may be not just the previous ps server is running. I kill all ps process before I run. Before I run, I check the output of this ps -ef as follows
$ ps -ef |grep python | grep linkaixi
linkaixi 13735 27603 0 10:40 pts/20 00:00:00 grep --color=auto python
linkaixi 29392 4232 0 Jan07 ? 00:00:05 /usr/bin/python3 /usr/bin/update-manager --no-update --no-focus-on-map
The universe log is like this with dead thread
[2017-01-09 10:53:47,810] Making new env: PongDeterministic-v3
[2017-01-09 10:53:50,394] Events directory: /home/linkaixi/20170109_10-53/train_15
[2017-01-09 10:53:50,456] Starting session. If this hangs, we're mostly likely waiting to connect to the parameter server. One common cause is that the parameter server DNS name isn't resolving yet, or is misspecified.
[2017-01-09 10:53:51,872] Resetting environment
[2017-01-09 10:53:51,874] Starting training at step=0
and this with normal one
[2017-01-09 10:53:47,882] Making new env: PongDeterministic-v3
[2017-01-09 10:53:50,201] Events directory: /home/linkaixi/20170109_10-53/train_3
[2017-01-09 10:53:50,331] Starting session. If this hangs, we're mostly likely waiting to connect to the parameter server. One common cause is that the parameter server DNS name isn't resolving yet, or is misspecified.
[2017-01-09 10:53:52,046] Starting training at step=0
[2017-01-09 10:53:52,048] Resetting environment
[2017-01-09 10:54:03,259] Episode terminating: episode_reward=-20.0 episode_length=903
[2017-01-09 10:54:03,287] Resetting environment
[2017-01-09 10:54:15,069] Episode terminating: episode_reward=-21.0 episode_length=916
[2017-01-09 10:54:15,123] Resetting environment
[2017-01-09 10:54:26,735] Episode terminating: episode_reward=-21.0 episode_length=908
[2017-01-09 10:54:26,816] Resetting environment
[2017-01-09 10:54:40,226] Episode terminating: episode_reward=-19.0 episode_length=1043
[2017-01-09 10:54:40,277] Resetting environment
@gdb I have logs of a similar failure in /mnt/kube-efs/universe-perfmon/usa-flashgames.NeonRace-v0-20170109-173626
.
This was from running python train.py --num-workers 8 --env-id flashgames.NeonRace-v0 --log-dir /mnt/kube/universe-perfmon/usa-flashgames.NeonRace-v0-20170109-173626 -m child
inside a container on tlb-0.devbox.sci
.
2/8 workers died:
a3c.w-0.out
and tmpcopy/universe-5214.log
a3c.w-3.out
and tmpcopy/universe-5221.log
In worker 0, It seems like the remote never started episode 22. The remote reports:
universe-voecMj-0 [2017-01-09 19:00:00,166] [INFO:root] [EnvStatus] Changing env_state: resetting (env_id=flashgames.NeonRace-v0) -> running (env_id=flashgames.NeonRace-v0) (episode_id: 22->22, fps=60)
universe-voecMj-0 [nginx] 2017/01/09 19:00:00 [info] 62#62: *374 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:80
universe-voecMj-0 [nginx] 2017/01/09 19:00:00 [info] 62#62: *375 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:80
universe-voecMj-0 [nginx] 2017/01/09 19:00:00 [info] 62#62: *376 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:80
But the last thing state change the agent saw was:
[2017-01-09 18:58:57,558] [0:localhost:5900] RewardBuffer: Creating new RewardState for episode_id=22
[2017-01-09 18:58:57,579] [0:localhost:5900] Received v0.env.describe: env_id=flashgames.NeonRace-v0 env_state=resetting episode_id=22
[2017-01-09 18:58:57,580] [0:localhost:5900] RewardBuffer changing env_state: running (env_id=flashgames.NeonRace-v0) -> resetting (env_id=flashgames.NeonRace-v0) (episode_id: 21->22, fps=60, masked=Fal
se, current_episode_id=21)
[2017-01-09 18:58:57,818] [0:localhost:5900] RewardState: popping reward 0.0 from episode_id 21
[2017-01-09 18:58:57,818] [0:localhost:5900] RewardBuffer advancing: has data for next episode: 21->22
[2017-01-09 18:58:57,818] [0:localhost:5900] Episode ended: episode_id=21->22 env_state=resetting
after which it continued getting VNC updates but no rewarder updates.
looks like each time the uninitialized variables are local variables. In a3c.py, why do we need to separate two tf.device
?
worker_device = "/job:worker/task:{}/cpu:0".format(task)
with tf.device(tf.train.replica_device_setter(1, worker_device=worker_device)):
with tf.variable_scope("global"):
self.network = LSTMPolicy(env.observation_space.shape, env.action_space.n)
self.global_step = tf.get_variable("global_step", [], tf.int32, initializer=tf.zeros_initializer,
trainable=False)
with tf.device(worker_device):
with tf.variable_scope("local"):
self.local_network = pi = LSTMPolicy(env.observation_space.shape, env.action_space.n)
pi.global_step = self.global_step
I think delete the second one may help work around this bug, since we copy all variables form ps to workers.
worker_device = "/job:worker/task:{}/cpu:0".format(task)
with tf.device(tf.train.replica_device_setter(1, worker_device=worker_device)):
with tf.variable_scope("global"):
self.network = LSTMPolicy(env.observation_space.shape, env.action_space.n)
self.global_step = tf.get_variable("global_step", [], tf.int32, initializer=tf.zeros_initializer,
trainable=False)
with tf.variable_scope("local"):
self.local_network = pi = LSTMPolicy(env.observation_space.shape, env.action_space.n)
pi.global_step = self.global_step
based on tensorflow documentation. Please let me know if I understand it wrong. @tlbtlbtlb Thanks.
Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter() to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker
@rafaljozefowicz @yaroslavvb Opinions?
We want two tf.device
blocks because we want to make rollouts done inside worker processes and not inside PS process. That could be fine with a handful of workers but when running with 8 or 16, you'll see that the whole system is slower than has to be, especially if the workers would run on separate machines.
just try 8 workers on one machine with a 6 cores cpu. performance looks fine but it took longer time to converge. It should be able to solved in 30 minutes. but took around 50 minutes without the second device.
"looks like each time the uninitialized variables are local variables" <-- aha, that's the important bit
Note that in worker.py, Supervisor is initialized with the following ready_op
variables_to_save = [v for v in tf.all_variables() if not v.name.startswith("local")]
Supervisor(...., ready_op=tf.report_uninitialized_variables(variables_to_save))
So the supervisor has init_fn
which should initialize local variables, and ready_op
which doesn't check whether local variables are initialized. So I suspect your training loop starts before the init_fn completes. You can check if there's "Initializing all parameters." message on your console which would confirm that initialization started (but that doesn't confirm that initialization finished).
@rafaljozefowicz any reason not to use the default ready_op (tf.report_uninitialized_variables()
) in the Supervisor init?
I can reproduce the failure reliably by adding a delay to the trainer main loop in worker.py
while not sv.should_stop() and (not num_global_steps or global_step < num_global_steps):
time.sleep(10)
trainer.process(sess)
What's happening here is a race between the Runner thread which runs the policy.act
and hence needs local variables to be initialized, and Main thread which runs process
which copies global variables to local variables.
Notice the relevant code
trainer.start(sess, summary_writer)
global_step = sess.run(trainer.global_step)
logger.info("Starting training at step=%d", global_step)
while not sv.should_stop() and (not num_global_steps or global_step < num_global_steps):
trainer.process(sess)
The first line launches a new Python thread which starts running the policy from local variables. The last line runs training operation which copies global variables to local variables. Most of the time you are lucky, and the trainer.process
line gets evaluated before Python is able to spin up a new thread. Until that line, local variables are not initialized.
The role of init_fn
is to initialize local variables explicitly, but that function only gets called when checkpoints are not present (session_manager.py:257)
I think the way to fix it is to run a sync
op before launching Runner threads, ie before trainer.start
in worker.py
sess.run(trainer.sync)
trainer.start(sess, summary_writer)
Some threads die and it varies from run to run. With the same code, it some times run all good, sometimes I got several threads broken. the command I use is
python train.py --num-workers 32 --env-id PongDeterministic-v3 --log-dir /tmp/pong
I found two situations a threads will die
First is
FailedPreconditionError: Attempting to use uninitialized value local/l2/W
Second traceback toQueue.Empty
and this also causeAttempting to use uninitialized value
The only thing I changed is to replace tmux interface with nohup in the train.py, and it can run well for all threads sometimes. code attached train.py.zip
Is this expected or I miss something? Any suggestions to fix this ? Thanks a lot!