Closed pengsun closed 6 years ago
I haven't seen it tried. I would expect it to increase learning speed, because ADAM's momentum and variance estimates would be averaged across all the learners and therefore less noisy.
A systematic comparison of the relative performance with local and global optimizer state would make a good paper.
Hi @tlbtlbtlb, sorry I didn't make my question clear...
I mean, the original A3C paper reported that shared RMSProp (i.e., the running average term, g, is shared across all threads) is more "robust" than the per-thread version (i.e., each thread has its own g term), see "Section 7. Optimization Details". And this shared version is indeed tried by a couple of A3C implementations elsewhere.
My question is: how to implement shared RMSProp with pure distributed Tensorflow? (I'm not familiar with TF so this might be a naive question...) Other implementation often relies on python threading
or multiprocessing
library.
Shared RMSProp with good real-time performance will be an interesting project. We encourage you to fork the code and experiment.
I'd be happy to! Actually I had a private A3C impl in Torch 7, which works well using the shared RMSProp.
But currently I'm not sure how to do the shared RMSProp in distributed TF (I'm a newbie and there are too many abstractions to learn...). Can I just create an Optimizer before this line:
https://github.com/openai/universe-starter-agent/blob/master/worker.py#L27
then pass it to A3C
class constructor and remove the per-thread optimizer construction here?
https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L241
I worry this won't work because it is across devices. And I found the TF doc saying "...and the parameter update operations in a tf.train.Optimizer must run on the same device as the variable. Incompatible device placement directives will be ignored when creating these operations." https://www.tensorflow.org/programmers_guide/variables
Really appreciate if you could provide some hints, thanks!
Keep in mind that everything in worker.py is also run N times, once for each worker, so creating something there doesn't make it shared. Rather, the variables created under the replica_device_setter device at https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L173 get shared. See https://www.tensorflow.org/api_docs/python/tf/train/replica_device_setter.
It probably won't work to create the AdamOptimizer within that block. No computation happens on the parameter server, so each worker would have to make several copies back and forth to run the Adam algorithm. And there's no locking, so you might get mixed updates.
You probably need to use a Queue, where most workers queue the pre-Adam gradient update, and one worker dequeues them and pumps them through the AdamOptimizer.
Will think about it. Thanks for your explanations @tlbtlbtlb :)
When I add log_device_placement=True to the ConfigProto at https://github.com/openai/universe-starter-agent/blob/master/worker.py#L49, then run the train script
python train.py --mode child --num-workers 4 --env-id PongDeterministic-v3 --log-dir /tmp/pong
and check for device placements of the ops added by Adam in one of the worker processes
grep Adam /tmp/pong/a3c.w-2.out
then I get the following (showing first 40 lines):
Adam/epsilon: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/epsilon: (Const)/job:worker/replica:0/task:2/cpu:0
Adam/beta2: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/beta2: (Const)/job:worker/replica:0/task:2/cpu:0
Adam/beta1: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/beta1: (Const)/job:worker/replica:0/task:2/cpu:0
Adam/learning_rate: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/learning_rate: (Const)/job:worker/replica:0/task:2/cpu:0
global/value/b/Adam_1: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam_1: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam_1/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam_1/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam_1/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam_1/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam_1: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam_1: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam_1/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam_1/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam_1/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam_1/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam_1: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam_1: (Variable)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam_1/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam_1/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam_1/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam_1/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam: (Variable)/job:ps/replica:0/task:0/cpu:0
That is it seems that only the epsilon, beta1, beta2, and learning rate are placed on the worker while all the update ops and moment estimates are placed on ps, and so I would assume that the code implements the case where the optimizer is shared. My guess is that the Adam ops are placed on the ps job because of the N.B. here https://www.tensorflow.org/api_docs/python/tf/Graph#device, i.e. some of the ops in Adam override the worker_device placement and end up on the ps tasks where the global variables live.
@tlbtlbtlb Is the above expected behavior? If so, could you please clarify why is it consistent with each worker having separate Adam optimizer parameters?
You're right that Adam shares most of the state variables and operations, by putting them on job:ps
. This logic is inside the AdamOptimizer class -- see:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py#L455
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py#L108
It uses colocate_with
to put the momentum & variance accumulators on the same node as the global variable, and because they have the same name they get shared.
I don't know if the momentum & variance updates are free from races among multiple workers writing gradients. If the parameter server has multiple inter-op threads, you could imagine a race between reading and writing v
here: https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a246f54c506aae4587dbce5d3bcf0/tensorflow/core/kernels/training_ops.cc#L250
The betaN_power
variables are the only ones that change -- betaN
and epsilon
don't change, so they should be cached on the ps
node.
Thanks for looking into this. There is a relevant discussion here tensorflow/tensorflow#6360 which suggests that races are a problem. One could set use_locking=True
in the apply_gradients()
call here https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L242 but that only protects single variable assignments rather than the whole update block, e.g.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py#L170.
Also according to the referenced issue it wouldn't protect the momentum and variance accumulators which are updated by one worker from being read by another worker trying to update the parameters. I.e. it still might be true that some of the parameter updates from one gradient are derived from momentum & variance accumulators which have already been partially updated by another gradient.
Maybe the ResourceVariable class mentioned here https://github.com/tensorflow/tensorflow/issues/6360#issuecomment-271741913 will help with this?
Hi, thanks for the nice code, which also shows advanced distributed Tensorflow coding. Wondering if it is possible to use a shared optimizer using also the distributed Tensorflow, which should achieve better results as reported by the paper? Currently each A3C worker has its own: https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L241
I understand this repository intends for universe... just a quick question if you happen to know the answer:)