Is shared Optimizer possible?

pengsun commented 7 years ago

Hi, thanks for the nice code, which also shows advanced distributed Tensorflow coding. Wondering if it is possible to use a shared optimizer using also the distributed Tensorflow, which should achieve better results as reported by the paper? Currently each A3C worker has its own: https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L241

I understand this repository intends for universe... just a quick question if you happen to know the answer:)

tlbtlbtlb commented 7 years ago

I haven't seen it tried. I would expect it to increase learning speed, because ADAM's momentum and variance estimates would be averaged across all the learners and therefore less noisy.

A systematic comparison of the relative performance with local and global optimizer state would make a good paper.

pengsun commented 7 years ago

Hi @tlbtlbtlb, sorry I didn't make my question clear...

I mean, the original A3C paper reported that shared RMSProp (i.e., the running average term, g, is shared across all threads) is more "robust" than the per-thread version (i.e., each thread has its own g term), see "Section 7. Optimization Details". And this shared version is indeed tried by a couple of A3C implementations elsewhere.

My question is: how to implement shared RMSProp with pure distributed Tensorflow? (I'm not familiar with TF so this might be a naive question...) Other implementation often relies on python threading or multiprocessing library.

tlbtlbtlb commented 7 years ago

Shared RMSProp with good real-time performance will be an interesting project. We encourage you to fork the code and experiment.

pengsun commented 7 years ago

I'd be happy to! Actually I had a private A3C impl in Torch 7, which works well using the shared RMSProp.

But currently I'm not sure how to do the shared RMSProp in distributed TF (I'm a newbie and there are too many abstractions to learn...). Can I just create an Optimizer before this line: https://github.com/openai/universe-starter-agent/blob/master/worker.py#L27 then pass it to A3C class constructor and remove the per-thread optimizer construction here? https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L241

I worry this won't work because it is across devices. And I found the TF doc saying "...and the parameter update operations in a tf.train.Optimizer must run on the same device as the variable. Incompatible device placement directives will be ignored when creating these operations." https://www.tensorflow.org/programmers_guide/variables

Really appreciate if you could provide some hints, thanks!

tlbtlbtlb commented 7 years ago

Keep in mind that everything in worker.py is also run N times, once for each worker, so creating something there doesn't make it shared. Rather, the variables created under the replica_device_setter device at https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L173 get shared. See https://www.tensorflow.org/api_docs/python/tf/train/replica_device_setter.

It probably won't work to create the AdamOptimizer within that block. No computation happens on the parameter server, so each worker would have to make several copies back and forth to run the Adam algorithm. And there's no locking, so you might get mixed updates.

You probably need to use a Queue, where most workers queue the pre-Adam gradient update, and one worker dequeues them and pumps them through the AdamOptimizer.

pengsun commented 7 years ago

Will think about it. Thanks for your explanations @tlbtlbtlb :)

jhumplik commented 7 years ago

When I add log_device_placement=True to the ConfigProto at https://github.com/openai/universe-starter-agent/blob/master/worker.py#L49, then run the train script

python train.py --mode child --num-workers 4 --env-id PongDeterministic-v3 --log-dir /tmp/pong

and check for device placements of the ops added by Adam in one of the worker processes

grep Adam /tmp/pong/a3c.w-2.out

then I get the following (showing first 40 lines):

Adam/epsilon: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/epsilon: (Const)/job:worker/replica:0/task:2/cpu:0
Adam/beta2: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/beta2: (Const)/job:worker/replica:0/task:2/cpu:0
Adam/beta1: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/beta1: (Const)/job:worker/replica:0/task:2/cpu:0
Adam/learning_rate: (Const): /job:worker/replica:0/task:2/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] Adam/learning_rate: (Const)/job:worker/replica:0/task:2/cpu:0
global/value/b/Adam_1: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam_1: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam_1/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam_1/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam_1/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam_1/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/b/Adam/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/b/Adam/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam_1: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam_1: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam_1/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam_1/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam_1/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam_1/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam: (Variable)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/value/w/Adam/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/value/w/Adam/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam_1: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam_1: (Variable)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam_1/read: (Identity): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam_1/read: (Identity)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam_1/Assign: (Assign): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam_1/Assign: (Assign)/job:ps/replica:0/task:0/cpu:0
global/action/b/Adam: (Variable): /job:ps/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] global/action/b/Adam: (Variable)/job:ps/replica:0/task:0/cpu:0

That is it seems that only the epsilon, beta1, beta2, and learning rate are placed on the worker while all the update ops and moment estimates are placed on ps, and so I would assume that the code implements the case where the optimizer is shared. My guess is that the Adam ops are placed on the ps job because of the N.B. here https://www.tensorflow.org/api_docs/python/tf/Graph#device, i.e. some of the ops in Adam override the worker_device placement and end up on the ps tasks where the global variables live.

@tlbtlbtlb Is the above expected behavior? If so, could you please clarify why is it consistent with each worker having separate Adam optimizer parameters?

tlbtlbtlb commented 7 years ago

You're right that Adam shares most of the state variables and operations, by putting them on job:ps. This logic is inside the AdamOptimizer class -- see: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py#L455 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py#L108

It uses colocate_with to put the momentum & variance accumulators on the same node as the global variable, and because they have the same name they get shared.

I don't know if the momentum & variance updates are free from races among multiple workers writing gradients. If the parameter server has multiple inter-op threads, you could imagine a race between reading and writing v here: https://github.com/tensorflow/tensorflow/blob/2c8d0dca978a246f54c506aae4587dbce5d3bcf0/tensorflow/core/kernels/training_ops.cc#L250

The betaN_power variables are the only ones that change -- betaN and epsilon don't change, so they should be cached on the ps node.

jhumplik commented 7 years ago

Thanks for looking into this. There is a relevant discussion here tensorflow/tensorflow#6360 which suggests that races are a problem. One could set use_locking=True in the apply_gradients() call here https://github.com/openai/universe-starter-agent/blob/master/a3c.py#L242 but that only protects single variable assignments rather than the whole update block, e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py#L170.

Also according to the referenced issue it wouldn't protect the momentum and variance accumulators which are updated by one worker from being read by another worker trying to update the parameters. I.e. it still might be true that some of the parameter updates from one gradient are derived from momentum & variance accumulators which have already been partially updated by another gradient.

Maybe the ResourceVariable class mentioned here https://github.com/tensorflow/tensorflow/issues/6360#issuecomment-271741913 will help with this?

openai / universe-starter-agent

Is shared Optimizer possible? #67