Some workers die occasionally and not all workers contribute to A3C

wuhy08 commented 7 years ago

I setup a training process on AWS with m4.10xlarge instance type. Then I run sudo python train.py --num-workers 20 --env-id flashgames.NeonRace-v0 --log-dir /tmp/neonrace --sudo where --sudo gives root access to commands in tmux.

I found out that some workers still die occasionally. An example of log file found in /tmp/universe-xxx.log is attached here. And the screen print before dying is:

universe-NskPIh-0 | [2017-04-16 01:31:49,799] [INFO:universe.pyprofile] [pyprofile] period=5.02s timers={"reward.parsing.score": {"mean": "494.54us", "calls": 23, "std": "341.04us"}, "rewarder.frame": {"mean": "17.31ms", "calls": 301, "std": "973.30us"}, "rewarder.sleep": {"mean": "15.26ms", "calls": 301, "std": "1.51ms"}, "vnc_env.VNCEnv.vnc_session.step": {"mean": "380.68us", "calls": 301, "std": "1.04ms"}, "score.crop_cache.get.OCRScorerV0": {"mean": "133.93us", "calls": 23, "std": "33.23us"}, "score.crop_cache.get.MatchImage": {"mean": "72.03us", "calls": 23, "std": "10.34us"}, "reward.parsing.gameover": {"mean": "162.37us", "calls": 23, "std": "40.89us"}, "rewarder.compute_reward": {"mean": "670.39us", "calls": 301, "std": "1.10ms"}} counters={"agent_conn.reward": {"mean": 0.0, "calls": 1, "std": 0}, "reward.vnc.updates.n": {"mean": 0.07973421926910297, "calls": 301, "std": 0.283351246909447}, "score.crop_cache.hit.MatchImage": {"mean": 1.0, "calls": 23, "std": 0.0}, "score.crop_cache.hit.OCRScorerV0": {"mean": 1.0, "calls": 23, "std": 0.0}} gauges={"reward_parser.score.last_score": {"std": 0.0, "calls": 23, "value": 41.0, "mean": 41.0}} (export_time=228.88us)
universe-NskPIh-0 | [2017-04-16 01:31:49,799] [INFO:universe.rewarder.remote] [Rewarder] Over past 5.02s, sent 1 reward messages to agent: reward=0 reward_min=0 reward_max=0 done=False info={'rewarder.vnc.updates.n': 0, 'rewarder.profile': '<1712 bytes>', 'rewarder.vnc.updates.pixels': 0, 'rewarder.vnc.updates.bytes': 0}
universe-NskPIh-0 | [2017-04-16 01:31:50,101] [INFO:universe.wrappers.logger] Stats for the past 5.02s: vnc_updates_ps=5.0 n=1 reaction_time=None observation_lag=None action_lag=None reward_ps=0.0 reward_total=0.0 vnc_bytes_ps[total]=6175.9 vnc_pixels_ps[total]=45721.9 reward_lag=None rewarder_message_lag=None fps=60.00
universe-NskPIh-0 | [2017-04-16 01:31:54,801] [INFO:universe.pyprofile] [pyprofile] period=5.00s timers={"reward.parsing.score": {"mean": "421.68us", "calls": 23, "std": "83.92us"}, "rewarder.frame": {"mean": "17.28ms", "calls": 300, "std": "816.12us"}, "rewarder.sleep": {"mean": "15.37ms", "calls": 300, "std": "1.18ms"}, "vnc_env.VNCEnv.vnc_session.step": {"mean": "284.72us", "calls": 300, "std": "700.91us"}, "score.crop_cache.get.OCRScorerV0": {"mean": "124.74us", "calls": 23, "std": "28.87us"}, "score.crop_cache.get.MatchImage": {"mean": "81.79us", "calls": 23, "std": "24.96us"}, "reward.parsing.gameover": {"mean": "165.96us", "calls": 23, "std": "27.61us"}, "rewarder.compute_reward": {"mean": "581.98us", "calls": 300, "std": "766.78us"}} counters={"agent_conn.reward": {"mean": 0.0, "calls": 1, "std": 0}, "reward.vnc.updates.n": {"mean": 0.07999999999999996, "calls": 300, "std": 0.2837870981155701}, "score.crop_cache.hit.MatchImage": {"mean": 1.0, "calls": 23, "std": 0.0}, "score.crop_cache.hit.OCRScorerV0": {"mean": 1.0, "calls": 23, "std": 0.0}} gauges={"reward_parser.score.last_score": {"std": 0.0, "calls": 23, "value": 41.0, "mean": 41.0}} (export_time=262.74us)
universe-NskPIh-0 | [2017-04-16 01:31:54,802] [INFO:universe.rewarder.remote] [Rewarder] Over past 5.00s, sent 1 reward messages to agent: reward=0 reward_min=0 reward_max=0 done=False info={'rewarder.vnc.updates.n': 0, 'rewarder.profile': '<1720 bytes>', 'rewarder.vnc.updates.pixels': 0, 'rewarder.vnc.updates.bytes': 0}
universe-NskPIh-0 | [2017-04-16 01:31:55,116] [INFO:universe.wrappers.logger] Stats for the past 5.01s: vnc_updates_ps=4.8 n=1 reaction_time=None observation_lag=None action_lag=None reward_ps=0.0 reward_total=0.0 vnc_bytes_ps[total]=5948.8 vnc_pixels_ps[total]=43981.9 reward_lag=None rewarder_message_lag=None fps=60.03
Traceback (most recent call last):
  File "worker.py", line 152, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "worker.py", line 144, in main
    run(args, server)
  File "worker.py", line 81, in run
    trainer.process(sess)
  File "/home/ubuntu/universe-starter-agent/a3c.py", line 270, in process
    rollout = self.pull_batch_from_queue()
  File "/home/ubuntu/universe-starter-agent/a3c.py", line 254, in pull_batch_from_queue
    rollout = self.runner.queue.get(timeout=600.0)
  File "/usr/lib/python2.7/Queue.py", line 176, in get
    raise Empty
Queue.Empty
[2017-04-16 01:31:56,444] Killing and removing container: id=3ea089f2b8b38009a090d9d5c9de2942f0c311d10e688642294e9adf4a75bb37

It looks the Queue.Empty is thrown just like other issues.

Another issue is that it seems not all workers contribute to the A3C.

As we can see in the figure below, everything seems normal

screen shot 2017-04-15 at 18 58 56

However, if we toggle off the normal training processes, we found there are 7 workers not plotting on the tensorboard. Is this a sign that they are not contributing?

screen shot 2017-04-15 at 19 02 18

Tensorflow==0.12.1 OS==Ubuntu 14.04

tlbtlbtlb commented 7 years ago

This is usually a symptom of running too many processes for the number of cores on your machine, so that the logic to detect end-of-game states can't keep up. Each worker requires 2-3 cores. If you want to run 20 workers, you'll need to spread them across at least 3 m4.10xlarges.

crci commented 7 years ago

To provide one more datapoint, I have 12 agents in a c4.8xlarge and they also keep crashing. This is a summary of the crashes:

Window: w-8 Worker: 8 at 2017-04-16 22:16
Window: w-0 Worker: 0 at 2017-04-16 22:27
Window: w-5 Worker: 5 at 2017-04-16 22:56
Window: w-5 Worker: 5 at 2017-04-16 23:07
Window: w-2 Worker: 2 at 2017-04-17 01:16
Window: w-11 Worker: 1 at 2017-04-17 02:57
Window: w-11 Worker: 1 at 2017-04-17 02:58
Window: w-9 Worker: 9 at 2017-04-17 03:12
Window: w-1 Worker: 1 at 2017-04-17 10:19
Window: w-11 Worker: 1 at 2017-04-17 10:39

Reaction time is ~120ms

tlbtlbtlb commented 7 years ago

A c4.8xlarge has 18 cores (we count real cores, not virtual threads), so 12 workers is too much. 6-8 workers is optimal. If top reports more than about 80% CPU usage, it won't work reliably.

wuhy08 commented 7 years ago

Hi, thank you for the comments. I reduced the number of workers and found it works better. My experience is that at beginning, the workers are expected to quit a lot, and when everything is set up, they can run seamlessly without interruption.

tlbtlbtlb commented 7 years ago

I recommend not starting with more workers than the machine can handle, because the agent won't be learning the right thing. It'll learn the optimal actions for delayed observations which may be executed for longer than 1/fps. Then when agents die, the remaining ones will have more CPU time and will be playing a different game. And all the results will be horribly irreproducible.

If this were biology instead of machine learning, it'd be like putting all your experimental mice in one tiny cage with not enough food. Your result won't measure the thing you were trying to measure.

openai / universe-starter-agent

Some workers die occasionally and not all workers contribute to A3C #87