The VariableMgrDistributedReplicated decrease the speed of convergence

Sampson1107 commented 6 years ago

@reedwm

Hi, I am in trouble during using the following code. ""' for i, (g, v) in enumerate(grads): apply_gradient_op = opt.apply_gradients([(g, v)]) barrier = self.benchmark_cnn.add_sync_queues_and_barrier( 'replicatevariable%s' % i, [apply_gradient_op]) """ Here, the servers run the op "apply_gradient_op" one by one, but not average the servers's gradients. While the optimizer is Momentum, this will result of different update_value with the average value. In my application, the speed of convergence will slower and the training time will increase. Is there any method to send an op that can add all the gradients and return the sum of all server's gradients? Thanks a lot!

reedwm commented 6 years ago

I'm a bit confused what the issue is. Each worker applies their gradients to the parameter server's variables. Once each worker has done so, they read back the updated parameter's server variables. Because each worker waits until all other workers are finished, they will all read the same value in updated_value.

Are you trying to say its bad that each worker has it's own MomentumOptimizer instead of using a globally shared MomentumOptimizer? I don't think that would be a huge problem, because each worker's momentum should be approximate the same, as each worker gets the same distribution of images. I could be wrong though, perhaps it is a big problem.

x666633 commented 6 years ago

Hi Reedwm: How do you do? We meet some troubles.Can you help and give us some suggestions? Thanks a lot

How could I implement benchmark mode to SyncReplicasOptimizer;

Sampson1107 commented 6 years ago

@reedwm Thanks for your reply! For the momentum optimizer. accumulation = momentum accumulation + gradient variable -= learning_rate accumulation For example, there is two servers(In fact, we have many servers). For the first server: accumulation1 = momentum accumulation0 + gradient1 variable -= learning_rate accumulation1 For the sencond server: accumulation2 = momentum accumulation1 + gradient2=momentum (momentum accumulation0 + gradient1)+gradient2 variable -= learning_rate * accumulation2

and one by one, this is different with the average mode: accumulation = momentum accumulation0 + (gradient1+gradient2)/2 variable -= learning_rate accumulation

Is there any method to get the average value but not to apply one by one;

How to modify the op to let the last server to apply gradients, while other servers add gradients? ""' for i, (g, v) in enumerate(grads): apply_gradient_op = opt.apply_gradients([(g, v)]) barrier = self.benchmark_cnn.add_sync_queues_and_barrier( 'replicatevariable%s' % i, [apply_gradient_op]) """

mingxingtan commented 6 years ago

I think you misunderstood the code here:

for i, (g, v) in enumerate(grads):
  apply_gradient_op = opt.apply_gradients([(g, v)])
  barrier = self.benchmark_cnn.add_sync_queues_and_barrier(
      'replicate_variable_%s' % i, [apply_gradient_op])

It doesn't mean each worker executes apply_gradient_op one by one; instead, it means the parameter server executes the opt_gradients once, and all workers wait until this op is done before proceeding. The grads here is already aggregated from all workers in get_gradients_to_apply.

Therefore, the working flow is the same as Fig 1 (not fig 2): Step1, each worker computes its own gradients; Step2, all workers send gradients to parameter server and parameter server aggregates those gradients; Step3, parameter server apply gradients with optimizer, meanwhile forces all workers waiting until this apply_gradients_op finished using add_sync_queues_and_barrier; Step4, variables are copied back to each worker.

Sampson1107 commented 6 years ago

@mingxingtan Thanks for your reply! I print the gradients and the real variable. I calculate the variable by the two methods. The real variable of the next step is equal to the mode of 'executes apply_gradient_op one by one', but not 'average all gradients and apply'.

Sampson1107 commented 6 years ago

@tfboyd @reedwm @mingxingtan I am in trouble of the problem and output one of the weight variable of the next step as follows: ''' aggregate and apply only once: -0.132180994674 apply gradients one by one: -0.132176235777 real result: -0.13217624 ''' It is exactly that the gradients are applied one by one, while not average in either 'distributed_replicated' mode or 'parameter_server' mode. I think I didn't misunderstand the code here:

for i, (g, v) in enumerate(grads): apply_gradient_op = opt.apply_gradients([(g, v)]) barrier = self.benchmark_cnn.add_sync_queues_and_barrier( 'replicatevariable%s' % i, [apply_gradient_op])

Is there any method to get the average value but not to apply one by one by adding an average op?

x666633 commented 6 years ago

@mingxingtan Thanks for your reply! The SyncReplicasOptimizer use data_flowops.ConditionalAccumulator to average grads from all workers like this: The get_gradients_to_apply how to achieve average grads from all workers

mingxingtan commented 6 years ago

@Sampson1107 @x666633 It turns out you are right: our current implementation does apply gradients one-by-one. This is okay for some optimizers like SGD, but not correct for momentum.

A possible fix is just to follow SyncReplicasOptimizer: first use a ConditionalAccumulator to aggregate all gradients, and then do a single opt.apply. The code change should be relatively simple, but there might be some performance impact.

We welcome community contribution. Anyone want to give it a try? If not, I would put up a CL and evaluate the performance next week.

Sampson1107 commented 6 years ago

@mingxingtan Thanks for your reply！We have attempt to follow SyncReplicasOptimizer and added it to the benchmark, but the program could not run successfully. Could you modify the apply barrier op to the average mode and share the code to us? Look forward to your reply!

tensorflow / benchmarks

The VariableMgrDistributedReplicated decrease the speed of convergence #115