Open smurching opened 6 years ago
I have a potential fix for this in #183 but I'm open to any suggestions! Thanks in advance :)
Hi @smurching, thanks for your PR. I tried it on 2 GPUs with your MNIST example, but it got stuck like this:
WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops: HorovodBroadcast_dense_1_kernel_0 [ready ranks: 1], HorovodBroadcast_dense_bias_0 [ready ranks: 1], HorovodBroadcast_conv2d_1_bias_0 [ready ranks: 1], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_dense_BiasAdd_grad_tuple_control_dependency_1_0 [ready ranks: 0], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_dense_MatMul_grad_tuple_control_dependency_1_0 [ready ranks: 0], HorovodBroadcast_conv2d_bias_Momentum_0 [ready ranks: 1], HorovodBroadcast_dense_kernel_0 [ready ranks: 1], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_dense_2_MatMul_grad_tuple_control_dependency_1_0 [ready ranks: 0], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_conv2d_2_BiasAdd_grad_tuple_control_dependency_1_0 [ready ranks: 0], HorovodBroadcast_dense_1_bias_0 [ready ranks: 1], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_conv2d_2_Conv2D_grad_tuple_control_dependency_1_0 [ready ranks: 0], HorovodBroadcast_dense_1_bias_Momentum_0 [ready ranks: 1], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_conv2d_BiasAdd_grad_tuple_control_dependency_1_0 [ready ranks: 0], HorovodBroadcast_conv2d_1_bias_Momentum_0 [ready ranks: 1], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_conv2d_Conv2D_grad_tuple_control_dependency_1_0 [ready ranks: 0], DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_gradients_dense_2_BiasAdd_grad_tuple_control_dependency_1_0 [ready ranks: 0], HorovodBroadcast_dense_1_kernel_Momentum_0 [ready ranks: 1], HorovodBroadcast_dense_bias_Momentum_0 [ready ranks: 1], HorovodBroadcast_global_step_0 [ready ranks: 1], HorovodBroadcast_conv2d_1_kernel_Momentum_0 [ready ranks: 1], HorovodBroadcast_dense_kernel_Momentum_0 [ready ranks: 1], HorovodBroadcast_conv2d_1_kernel_0 [ready ranks: 1], HorovodBroadcast_conv2d_bias_0 [ready ranks: 1], HorovodBroadcast_conv2d_kernel_0 [ready ranks: 1], HorovodBroadcast_conv2d_kernel_Momentum_0 [ready ranks: 1]
The issue seems to be caused by desync between workers. Since "time" is used as a criterion, what happened is one worker decided it's time for evaluation, while another worker decided to train one more batch. As a result, both workers got stuck.
Do you have any suggestions how this can be solved?
In my opinion, this is more of a bug than an enhancement.
Try calling estimator.train(...)
twice and you'll get the same error message.
To reproduce, use the example code of: https://github.com/uber/horovod/blob/master/examples/tensorflow_mnist_estimator.py
and add a second identical call to mnist_classifier.train()
:
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000 // hvd.size(),
hooks=[logging_hook, bcast_hook])
You will get the same issue. Running the same example without the hvd.DistributedOptimizer(...)
and hvd.BroadcastGlobalVariablesHook(...)
will work with no problems.
Good point. Indeed, having a loop that would repeatedly do training followed by evaluation makes perfect sense.
I have merged the PR. Thanks again, @smurching, for submitting it.
It does not fully solve the original issue in this thread though since the tf.estimator.train_and_evaluate
relies on the time. One way to solve this would be to send Google a PR adding periodic evaluation based on a number of steps rather than a number of seconds.
@alsrgv @smurching
I am experiencing the same issue (deadlock). Should horovod block the whole training process if one of the workers is gone?
P.S. my current workaround for the issue is having N
parallel workers that:
N-1
workers run a single estimator.train(...)
call1
worker runs a loop of estimator.evaluate(...); sleep(...);
@maorzalt, you could do the following (similar to what we do with Keras):
estimator.train(...)
estimator.evaluate(...)
for a subset of data (1/N if you can deterministically partition, or 3/N if you randomly sample - should give good enough results).hvd.allreduce()
metrics from the evaluate (2).That's what Keras ImageNet example is doing.
@alsrgv
Step 4 would be impossible due to calling estimator.train(...)
more than once :(
@maorzalt, it should be now, new version 0.12.0 includes @smurching's fix for that :-)
Apologies for being MIA - thanks so much @alsrgv and @maorzalt for the super-quick responses & suggestions!
FWIW another thing I've tried (similar to what @maorzalt described) is running training on all N workers and an evaluate/sleep loop in a separate process on one of the workers, since I tend to run eval pretty infrequently & I want to fully utilize the workers for training. Perf-wise I'm not sure if one approach is better, and there may not be a big difference :)
@smurching - thank you for the feedback.
@alsrgv - thanks for the PR!
I will test it in the upcoming days and tell you if it worked on my setup as well. In the meanwhile, keep up the good work! The horovod
approach to scaling the optimization process is amazing.
I tested the mnist example with 2 consecutive estimator.train(...)
calls on a machine with 4 GPUs and it works like a charm. Also, the synchronization issues mentioned above were resolved.
Great work @alsrgv & @smurching !
@alsrgv
I apologize to revive this... but I did experience the synchronization issue above when running it with my full model and pipeline with tf.estimator.train_and_evaluate(...)
. The training gets stuck after the first eval and sending this kind of errors:
... optimizer/DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_optimizer_gradients_model_dense_1_BiasAdd_grad_tuple_control_dependency_1_0 [ready ranks: 2, 3, 1]
I am trying to reproduce this with a short example with no success so far. Any suggestions?
@maorzalt, as I mentioned in https://github.com/uber/horovod/issues/182#issuecomment-369823019, I don't think there's a way to fix tf.estimator.train_and_evaluate(...)
, but you can instead just do tf.estimator.train(...)
followed up tf.estimator.evaluate(...)
in a loop.
@alsrgv
Thank you. I changed it as you suggested and it worked.
tf.estimator.train_and_evaluate(...) => estimator.train(...) + estimator.evaluate(...)
I start getting the larger Horovod picture. Workers are synched using the global step. This could potentially cause many additional issues.
Examples:
1
out of N
workers crashes => will lead to all N-1
workers stuck (waiting for 1
to finish the step)1
out of N
workers is slower => all other N-1
workers will wait for at each step, leading to a training speed that is as fast as the slowest workerAre these examples correct by design? If so, how can we help to improve?
ps - the tensorflow patch solution should start here by changing _StopAtSecsHook
to something new like _StopEveryNStepsHook
@maorzalt, yes, that's the nature of synchronous SGD. One caveat: if one of the worker crashes, other workers will fail rather than get stuck.
@maorzalt Can you share the code that you ultimately got working? I've been trying something like
for n in range(100):
model_estimator.train(
input_fn=lambda: get_inputs("train_*.tfrecords"),
steps=100,
hooks = [bcast_hook])
eval_results = model_estimator.evaluate(
input_fn=lambda: get_inputs("validation.tfrecords"),
steps=None)
but it seems slow and oddly synchronized. And I haven't been able to make things work using N-1 GPU's for training and 1 for evaluation with something like:
for n in range(100):
if hvd.rank() != 0:
model_estimator.train(
input_fn=lambda: get_inputs("train_*.tfrecords"),
steps=100,
hooks = [bcast_hook])
else:
eval_results = model_estimator.evaluate(
input_fn=lambda: get_inputs("validation.tfrecords"),
steps=None)
And relatedly, that first code snippet results in 4 python dicts for the eval_results
, one for each GPU. Is there a way to all-reduce into a single?
@mdagost I'm also curious about distributed evaluation (the original motivation behind this question) - as @alsrgv suggested you could try wrapping the tensors containing your evaluation metrics in an hvd.allreduce()
before returning from your estimator's model_fn
. I believe you'd still end up with separate Python dictionaries (one per training process / GPU) but they'd be identical / contain the average of your eval metrics across the processes. This seems most useful for cases where the average of the metric on n distinct batches equals the metric computed over the concatenation of the n batches (e.g. 0/1 loss) but could also be useful for e.g. RMSE where the above doesn't hold.
@smurching, @mdagost, the simple version is to use allreduce. However, for metrics that cannot be simply averaged across workers, I recommend using mpi4py to allgather intermediate evaluation results and compute the final metric value from the pieces.
Any final solutions to this issue?
@alsrgv
I have a query about #182 comment. As far as I know, calling tf.estimator.train(...)
and tf.estimator.evaluate(...)
in a loop will reset the dataset iterator for the training step everytime evaluate is called before the epoch is finished. This is problematic for larger datasets where the evaluate is called more frequently.
This seems to be the way Tensorflow has implemented. A detailed use case is mentioned in stack overflow
Hi all,
if I go train_and_evaluate() then I have to setup theRunConfig.save_checkpoints_secs
for all horovod workers in estimator or get the below error. would it be a bug?
ValueError: There should be a CheckpointSaverHook to use saving_listeners. Please set one of the RunConfig.save_checkpoints_steps or RunConfig.save_checkpoints_secs.
I came across another deadlock scenario when checkpoint is already saved after max_steps
iterations. In this case rank 0 exits train call before executing broadcast hook and other processes keeps waiting for matching call.
estimator = tf.estimator.Estimator(...)
# check if training is already done
steps_done = 0
if hvd.rank() == 0 and estimator.latest_checkpoint() is not None:
# since only rank 0 writes checkpoint
checkpoint_reader = tf.train.NewCheckpointReader(estimator.latest_checkpoint())
steps_done = checkpoint_reader.get_tensor(tf.GraphKeys.GLOBAL_STEP)
steps_done = tf.convert_to_tensor(steps_done, dtype=tf.int64)
with tf.Session() as sess:
# broadcast steps_done to all the processes
steps_done = hvd.broadcast(steps_done, 0)
steps_done = sess.run(steps_done)
if max_steps <= steps_done:
print("Skipping training since max_steps has already saved.")
else:
# do the training and eval
estimator.train(...)
@chychen, I don't think train_and_evaluate will work, since it uses time to decide when to run evaluation, which may cause different workers to execute a different # of batches. We may be able to resolve it once #1058 is implemented.
@pranavladkat, that's a very interesting use case, thanks for the workaround! It's similar to what's happening in Keras to determine the epoch to start from: https://github.com/horovod/horovod/blob/master/examples/keras_imagenet_resnet50.py#L73
@alsrgv for now, I only use one worker to run evaluation, it seems ok.
Correct me if I am wrong. So the conclusion is train_and_evaluate
will not work for Horovod. Uber Horovod team is also not seeking any integration to TensorFlow regarding this. We have to iteratively do estimator.train()
, and estimator.evaluate()
.
@leimao, at this point that's correct.
Following up on the previous two comments from @alsrgv @leimao and addressing the graph re-initialization challenges mentioned at https://github.com/horovod/horovod/issues/182#issuecomment-441830583, TensorFlow estimators do support evaluation on a per step basis.
Specifically, commit https://github.com/tensorflow/tensorflow/commit/3edb609926f2521c726737fc1efeae1572dc6581 addressed this shortcoming. The relevant discussion taking place here https://github.com/tensorflow/tensorflow/issues/17650
Thus, for training with train_and_evaluate
, set the throttle_secs
parameter in your passed EvalSpec
specification to 0
. This will enforce evaluation at the same time as checkpointing, which can be set to a desired number of steps. Horovod will then be synced across ranks as it performs validation.
Hi all,
I've been trying to run a modified version of the mnist estimator example (link to gist) using the
tf.estimator.train_and_evaluate
to intersperse training & evaluation. I'm hitting the following error when training resumes after the first run of evaluation:I'm using the following environment: