some threads die occasionally

KaixiangLin commented 7 years ago

Some threads die and it varies from run to run. With the same code, it some times run all good, sometimes I got several threads broken. the command I use is python train.py --num-workers 32 --env-id PongDeterministic-v3 --log-dir /tmp/pong

I found two situations a threads will die

First is FailedPreconditionError: Attempting to use uninitialized value local/l2/W Second traceback to Queue.Empty and this also cause Attempting to use uninitialized value

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
[2017-01-04 20:42:26,465] Writing logs to file: /tmp/universe-31816.log
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: illidan-gpu-3
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: illidan-gpu-3
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 375.26.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.26  Thu Dec  8 18:36:43 PST 2016
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 375.26.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 375.26.0
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223, 1 -> 127.0.0.1:12224, 2 -> 127.0.0.1:12225, 3 -> 127.0.0.1:12226, 4 -> 127.0.0.1:12227, 5 -> 127.0.0.1:12228, 6 -> 127.0.0.1:12229, 7 -> 127.0.0.1:12230, 8 -> 127.0.0.1:12231, 9 -> 127.0.0.1:12232, 10 -> 127.0.0.1:12233, 11 -> localhost:12234, 12 -> 127.0.0.1:12235, 13 -> 127.0.0.1:12236, 14 -> 127.0.0.1:12237, 15 -> 127.0.0.1:12238, 16 -> 127.0.0.1:12239, 17 -> 127.0.0.1:12240, 18 -> 127.0.0.1:12241, 19 -> 127.0.0.1:12242, 20 -> 127.0.0.1:12243, 21 -> 127.0.0.1:12244, 22 -> 127.0.0.1:12245, 23 -> 127.0.0.1:12246, 24 -> 127.0.0.1:12247, 25 -> 127.0.0.1:12248, 26 -> 127.0.0.1:12249, 27 -> 127.0.0.1:12250, 28 -> 127.0.0.1:12251, 29 -> 127.0.0.1:12252, 30 -> 127.0.0.1:12253, 31 -> 127.0.0.1:12254}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:206] Started server with target: grpc://localhost:12234
[2017-01-04 20:42:26,535] Making new env: PongDeterministic-v3
[2017-01-04 20:42:31,470] Events directory: /tmp/pong/train_11
[2017-01-04 20:42:31,606] Starting session. If this hangs, we're mostly likely waiting to connect to the parameter server. One common cause is that the parameter server DNS name isn't resolving yet, or is misspecified.
I tensorflow/core/distributed_runtime/master_session.cc:928] Start master session a05a01fed67b6549 with config:
device_filters: "/job:ps"
device_filters: "/job:worker/task:11/cpu:0"

I tensorflow/core/distributed_runtime/master_session.cc:928] Start master session eba1a87afe6c8f45 with config:
device_filters: "/job:ps"
device_filters: "/job:worker/task:11/cpu:0"

[2017-01-04 20:43:06,143] Resetting environment
[2017-01-04 20:43:06,150] Starting training at step=35960
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/linkaixi/Dropbox/code/a3c.py", line 91, in run
    self._run()
  File "/home/linkaixi/Dropbox/code/a3c.py", line 100, in _run
    self.queue.put(next(rollout_provider), timeout=600.0)
  File "/home/linkaixi/Dropbox/code/a3c.py", line 121, in env_runner
    fetched = policy.act(last_state, *last_features)
  File "/home/linkaixi/Dropbox/code/model.py", line 83, in act
    {self.x: [ob], self.state_in[0]: c, self.state_in[1]: h})
  File "/home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
    run_metadata_ptr)
  File "/home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
    feed_dict_string, options, run_metadata)
  File "/home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
    target_list, options, run_metadata)
  File "/home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
    raise type(e)(node_def, op, message)
FailedPreconditionError: Attempting to use uninitialized value local/value/b
         [[Node: local/value/b/read = Identity[T=DT_FLOAT, _class=["loc:@local/value/b"], _device="/job:worker/replica:0/task:11/cpu:0"](local/value/b)]]

Traceback (most recent call last):
  File "worker.py", line 122, in <module>
    server.join()
  File "/home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "worker.py", line 114, in main
    if args.job_name == "worker":
  File "worker.py", line 61, in run
    trainer.process(sess)
  File "/home/linkaixi/Dropbox/code/a3c.py", line 257, in process
    rollout = self.pull_batch_from_queue()
  File "/home/linkaixi/Dropbox/code/a3c.py", line 241, in pull_batch_from_queue
    rollout = self.runner.queue.get(timeout=600.0)
  File "/home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/Queue.py", line 176, in get
    raise Empty
Queue.Empty

The only thing I changed is to replace tmux interface with nohup in the train.py, and it can run well for all threads sometimes. code attached train.py.zip

Is this expected or I miss something? Any suggestions to fix this ? Thanks a lot!

tlbtlbtlb commented 7 years ago

I've seen this too, but can't reproduce it reliably.

If you restart that one worker, it should join the cluster and start working.

The interface for initializing variables changed a lot between tf 0.11 and 0.12. What's the exact version you're using? (Output of pip show tensorflow)

Another strange thing: this line

I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12223, 1 -> 127.0.0.1:12224, 2 -> 127.0.0.1:12225, 3 -> 127.0.0.1:12226, 4 -> 127.0.0.1:12227, 5 -> 127.0.0.1:12228, 6 -> 127.0.0.1:12229, 7 -> 127.0.0.1:12230, 8 -> 127.0.0.1:12231, 9 -> 127.0.0.1:12232, 10 -> 127.0.0.1:12233, 11 -> localhost:12234, 12 -> 127.0.0.1:12235, 13 -> 127.0.0.1:12236, 14 -> 127.0.0.1:12237, 15 -> 127.0.0.1:12238, 16 -> 127.0.0.1:12239, 17 -> 127.0.0.1:12240, 18 -> 127.0.0.1:12241, 19 -> 127.0.0.1:12242, 20 -> 127.0.0.1:12243, 21 -> 127.0.0.1:12244, 22 -> 127.0.0.1:12245, 23 -> 127.0.0.1:12246, 24 -> 127.0.0.1:12247, 25 -> 127.0.0.1:12248, 26 -> 127.0.0.1:12249, 27 -> 127.0.0.1:12250, 28 -> 127.0.0.1:12251, 29 -> 127.0.0.1:12252, 30 -> 127.0.0.1:12253, 31 -> 127.0.0.1:12254}

looks like you have 32 workers, but the train command only asks for 12.

KaixiangLin commented 7 years ago

I use 0.11, sorry for the typo, it should be 32, I run 12, 16, 32 workers and I cannot reproduce this reliably either, some runs all threads are working well, some not.

Name: tensorflow
Version: 0.11.0
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /home/linkaixi/anaconda2/envs/tensorflow11/lib/python2.7/site-packages
Requires: protobuf, numpy, mock, wheel, six

KaixiangLin commented 7 years ago

Any idea to fix this? Shouldn't the non-chief worker wait to be executed until the chief worker finish the initialization? According their documentation which I quote here

In the other tasks sv.managed_session() waits for the Model to have been initialized before returning a session to the training code. The non-chief tasks depend on the chief task for initializing the model.

Thanks a lot!

tlbtlbtlb commented 7 years ago

It should definitely work. I don't understand why it sometimes doesn't.

One possible cause: make sure that the parameter server process isn't still running from the last run. Due to something inside Tensorflow, ^C, kill <pid>, or kill -TERM <pid> don't kill the parameter server. You have to run kill -9 <pid> to kill it.

gdb commented 7 years ago

That sounds like your environment is getting into a bad state somehow. It'd be useful to see the logs from /tmp/universe-*.log, as well as the docker logs from the environment and stdout/stderr from your run. The logfiles generally have all the data to explain why something weird is happening.

gdb

On Sun, Jan 8, 2017 at 10:21 AM, Yigit notifications@github.com wrote:

This is something I suffer a lot. I am using 16CPU EC2 and running 8 workers on it. Eventually, my rewards goes to zero immediately and when I connect with vnc, I see no action is taken by worker. After a while, I got this exact error.. I don't have any idea what may cause it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openai/universe-starter-agent/issues/44#issuecomment-271168558, or mute the thread https://github.com/notifications/unsubscribe-auth/AAM7kcVwFcSDRE3-U6HHBnld2kYTt7ysks5rQSkYgaJpZM4LbTkM .

tlbtlbtlb commented 7 years ago

This:

FailedPreconditionError: Attempting to use uninitialized value local/value/b
         [[Node: local/value/b/read = Identity[T=DT_FLOAT, _class=["loc:@local/value/b"], _device="/job:worker/replica:0/task:11/cpu:0"](local/value/b)]]

and also this:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [6] rhs shape= [8]
     [[Node: global/action/b/Adam/Assign = Assign[T=DT_FLOAT, _class=["loc:@global/action/b"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](global/action/b/Adam, zeros_22)]]

can be caused by leaving a parameter server running from a previous experiment. The parameter server is one of the processes starter by train.py, and has --job-name ps in its arguments.

For some reason, Tensorflow in parameter server mode traps signals, so kill <pid> andtmux kill-session don't kill it. You have to use kill -9 <pid>.

I'll figure out a better solution, but for now try ps auxww | grep python and kill -9 any ps jobs, which look like this:

tlb              75227   0.0  0.8 557340092 127644   ??  S    12:54PM   0:04.89 /Users/tlb/anaconda3/envs/tf12/bin/python worker.py --log-dir /tmp/neonrace3 --env-id flashgames.NeonRace-v0 --num-workers 4 --job-name ps

KaixiangLin commented 7 years ago

I think the reason may be not just the previous ps server is running. I kill all ps process before I run. Before I run, I check the output of this ps -ef as follows

$ ps -ef |grep python | grep linkaixi
linkaixi 13735 27603  0 10:40 pts/20   00:00:00 grep --color=auto python
linkaixi 29392  4232  0 Jan07 ?        00:00:05 /usr/bin/python3 /usr/bin/update-manager --no-update --no-focus-on-map

The universe log is like this with dead thread

[2017-01-09 10:53:47,810] Making new env: PongDeterministic-v3
[2017-01-09 10:53:50,394] Events directory: /home/linkaixi/20170109_10-53/train_15
[2017-01-09 10:53:50,456] Starting session. If this hangs, we're mostly likely waiting to connect to the parameter server. One common cause is that the parameter server DNS name isn't resolving yet, or is misspecified.
[2017-01-09 10:53:51,872] Resetting environment
[2017-01-09 10:53:51,874] Starting training at step=0

and this with normal one

[2017-01-09 10:53:47,882] Making new env: PongDeterministic-v3
[2017-01-09 10:53:50,201] Events directory: /home/linkaixi/20170109_10-53/train_3
[2017-01-09 10:53:50,331] Starting session. If this hangs, we're mostly likely waiting to connect to the parameter server. One common cause is that the parameter server DNS name isn't resolving yet, or is misspecified.
[2017-01-09 10:53:52,046] Starting training at step=0
[2017-01-09 10:53:52,048] Resetting environment
[2017-01-09 10:54:03,259] Episode terminating: episode_reward=-20.0 episode_length=903
[2017-01-09 10:54:03,287] Resetting environment
[2017-01-09 10:54:15,069] Episode terminating: episode_reward=-21.0 episode_length=916
[2017-01-09 10:54:15,123] Resetting environment
[2017-01-09 10:54:26,735] Episode terminating: episode_reward=-21.0 episode_length=908
[2017-01-09 10:54:26,816] Resetting environment
[2017-01-09 10:54:40,226] Episode terminating: episode_reward=-19.0 episode_length=1043
[2017-01-09 10:54:40,277] Resetting environment

tlbtlbtlb commented 7 years ago

@gdb I have logs of a similar failure in /mnt/kube-efs/universe-perfmon/usa-flashgames.NeonRace-v0-20170109-173626.

This was from running python train.py --num-workers 8 --env-id flashgames.NeonRace-v0 --log-dir /mnt/kube/universe-perfmon/usa-flashgames.NeonRace-v0-20170109-173626 -m child inside a container on tlb-0.devbox.sci.

2/8 workers died:

worker 0: a3c.w-0.out and tmpcopy/universe-5214.log
worker 3: a3c.w-3.out and tmpcopy/universe-5221.log

In worker 0, It seems like the remote never started episode 22. The remote reports:

universe-voecMj-0 [2017-01-09 19:00:00,166] [INFO:root] [EnvStatus] Changing env_state: resetting (env_id=flashgames.NeonRace-v0) -> running (env_id=flashgames.NeonRace-v0) (episode_id: 22->22, fps=60)
universe-voecMj-0 [nginx] 2017/01/09 19:00:00 [info] 62#62: *374 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:80
universe-voecMj-0 [nginx] 2017/01/09 19:00:00 [info] 62#62: *375 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:80
universe-voecMj-0 [nginx] 2017/01/09 19:00:00 [info] 62#62: *376 client timed out (110: Connection timed out) while waiting for request, client: 127.0.0.1, server: 0.0.0.0:80

But the last thing state change the agent saw was:

[2017-01-09 18:58:57,558] [0:localhost:5900] RewardBuffer: Creating new RewardState for episode_id=22
[2017-01-09 18:58:57,579] [0:localhost:5900] Received v0.env.describe: env_id=flashgames.NeonRace-v0 env_state=resetting episode_id=22
[2017-01-09 18:58:57,580] [0:localhost:5900] RewardBuffer changing env_state: running (env_id=flashgames.NeonRace-v0) -> resetting (env_id=flashgames.NeonRace-v0) (episode_id: 21->22, fps=60, masked=Fal
se, current_episode_id=21)
[2017-01-09 18:58:57,818] [0:localhost:5900] RewardState: popping reward 0.0 from episode_id 21
[2017-01-09 18:58:57,818] [0:localhost:5900] RewardBuffer advancing: has data for next episode: 21->22
[2017-01-09 18:58:57,818] [0:localhost:5900] Episode ended: episode_id=21->22 env_state=resetting

after which it continued getting VNC updates but no rewarder updates.

KaixiangLin commented 7 years ago

looks like each time the uninitialized variables are local variables. In a3c.py, why do we need to separate two tf.device?

        worker_device = "/job:worker/task:{}/cpu:0".format(task)
        with tf.device(tf.train.replica_device_setter(1, worker_device=worker_device)):
            with tf.variable_scope("global"):
                self.network = LSTMPolicy(env.observation_space.shape, env.action_space.n)
                self.global_step = tf.get_variable("global_step", [], tf.int32, initializer=tf.zeros_initializer,
                                                   trainable=False)

        with tf.device(worker_device):
            with tf.variable_scope("local"):
                self.local_network = pi = LSTMPolicy(env.observation_space.shape, env.action_space.n)
                pi.global_step = self.global_step

I think delete the second one may help work around this bug, since we copy all variables form ps to workers.

        worker_device = "/job:worker/task:{}/cpu:0".format(task)
        with tf.device(tf.train.replica_device_setter(1, worker_device=worker_device)):
            with tf.variable_scope("global"):
                self.network = LSTMPolicy(env.observation_space.shape, env.action_space.n)
                self.global_step = tf.get_variable("global_step", [], tf.int32, initializer=tf.zeros_initializer,
                                                   trainable=False)
            with tf.variable_scope("local"):
                self.local_network = pi = LSTMPolicy(env.observation_space.shape, env.action_space.n)
                pi.global_step = self.global_step

based on tensorflow documentation. Please let me know if I understand it wrong. @tlbtlbtlb Thanks.

Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter() to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker

tlbtlbtlb commented 7 years ago

@rafaljozefowicz @yaroslavvb Opinions?

rafaljozefowicz commented 7 years ago

We want two tf.device blocks because we want to make rollouts done inside worker processes and not inside PS process. That could be fine with a handful of workers but when running with 8 or 16, you'll see that the whole system is slower than has to be, especially if the workers would run on separate machines.

KaixiangLin commented 7 years ago

just try 8 workers on one machine with a 6 cores cpu. performance looks fine but it took longer time to converge. It should be able to solved in 30 minutes. but took around 50 minutes without the second device. 8workersnolocaldevice

yaroslavvb commented 7 years ago

"looks like each time the uninitialized variables are local variables" <-- aha, that's the important bit

Note that in worker.py, Supervisor is initialized with the following ready_op

variables_to_save = [v for v in tf.all_variables() if not v.name.startswith("local")]
Supervisor(...., ready_op=tf.report_uninitialized_variables(variables_to_save))

So the supervisor has init_fn which should initialize local variables, and ready_op which doesn't check whether local variables are initialized. So I suspect your training loop starts before the init_fn completes. You can check if there's "Initializing all parameters." message on your console which would confirm that initialization started (but that doesn't confirm that initialization finished).

@rafaljozefowicz any reason not to use the default ready_op (tf.report_uninitialized_variables()) in the Supervisor init?

yaroslavvb commented 7 years ago

I can reproduce the failure reliably by adding a delay to the trainer main loop in worker.py

        while not sv.should_stop() and (not num_global_steps or global_step < num_global_steps):
            time.sleep(10)
            trainer.process(sess)

What's happening here is a race between the Runner thread which runs the policy.act and hence needs local variables to be initialized, and Main thread which runs process which copies global variables to local variables.

Notice the relevant code

        trainer.start(sess, summary_writer)
        global_step = sess.run(trainer.global_step)
        logger.info("Starting training at step=%d", global_step)
        while not sv.should_stop() and (not num_global_steps or global_step < num_global_steps):
            trainer.process(sess)

The first line launches a new Python thread which starts running the policy from local variables. The last line runs training operation which copies global variables to local variables. Most of the time you are lucky, and the trainer.process line gets evaluated before Python is able to spin up a new thread. Until that line, local variables are not initialized.

The role of init_fn is to initialize local variables explicitly, but that function only gets called when checkpoints are not present (session_manager.py:257)

I think the way to fix it is to run a sync op before launching Runner threads, ie before trainer.start in worker.py

    sess.run(trainer.sync)
    trainer.start(sess, summary_writer)

openai / universe-starter-agent

some threads die occasionally #44