Closed wuhy08 closed 6 years ago
This is usually a symptom of running too many processes for the number of cores on your machine, so that the logic to detect end-of-game states can't keep up. Each worker requires 2-3 cores. If you want to run 20 workers, you'll need to spread them across at least 3 m4.10xlarges.
To provide one more datapoint, I have 12 agents in a c4.8xlarge and they also keep crashing. This is a summary of the crashes:
Window: w-8 Worker: 8 at 2017-04-16 22:16
Window: w-0 Worker: 0 at 2017-04-16 22:27
Window: w-5 Worker: 5 at 2017-04-16 22:56
Window: w-5 Worker: 5 at 2017-04-16 23:07
Window: w-2 Worker: 2 at 2017-04-17 01:16
Window: w-11 Worker: 1 at 2017-04-17 02:57
Window: w-11 Worker: 1 at 2017-04-17 02:58
Window: w-9 Worker: 9 at 2017-04-17 03:12
Window: w-1 Worker: 1 at 2017-04-17 10:19
Window: w-11 Worker: 1 at 2017-04-17 10:39
Reaction time is ~120ms
A c4.8xlarge
has 18 cores (we count real cores, not virtual threads), so 12 workers is too much. 6-8 workers is optimal. If top
reports more than about 80% CPU usage, it won't work reliably.
Hi, thank you for the comments. I reduced the number of workers and found it works better. My experience is that at beginning, the workers are expected to quit a lot, and when everything is set up, they can run seamlessly without interruption.
I recommend not starting with more workers than the machine can handle, because the agent won't be learning the right thing. It'll learn the optimal actions for delayed observations which may be executed for longer than 1/fps. Then when agents die, the remaining ones will have more CPU time and will be playing a different game. And all the results will be horribly irreproducible.
If this were biology instead of machine learning, it'd be like putting all your experimental mice in one tiny cage with not enough food. Your result won't measure the thing you were trying to measure.
I setup a training process on AWS with
m4.10xlarge
instance type. Then I runsudo python train.py --num-workers 20 --env-id flashgames.NeonRace-v0 --log-dir /tmp/neonrace --sudo
where--sudo
gives root access to commands in tmux.I found out that some workers still die occasionally. An example of log file found in
/tmp/universe-xxx.log
is attached here. And the screen print before dying is:It looks the
Queue.Empty
is thrown just like other issues.Another issue is that it seems not all workers contribute to the A3C.
As we can see in the figure below, everything seems normal
However, if we toggle off the normal training processes, we found there are 7 workers not plotting on the tensorboard. Is this a sign that they are not contributing?
Tensorflow==0.12.1 OS==Ubuntu 14.04