I was wondering if anyone has tried customizing the code to multi GPU training instead of TPUs.
The current code works for a single GPU without a lot of modifications (set use_tpu = False). However, I am facing some trouble in running it with multi GPU.
I changed the configuration as follows (tensorflow-gpu 1.13.1):
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=FLAGS.num_gpus) run_config = tf.estimator.RunConfig( log_step_count_steps = 10, save_summary_steps = 10, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.iterations_per_loop, keep_checkpoint_max=5, train_distribute = distribution )
raise ValueError("You must specify an aggregation method to update a "
ValueError: You must specify an aggregation method to update a MirroredVariable in Replica Context.
Has anyone maybe found a solution to this?
Thanks.
Hi,
I was wondering if anyone has tried customizing the code to multi GPU training instead of TPUs. The current code works for a single GPU without a lot of modifications (set use_tpu = False). However, I am facing some trouble in running it with multi GPU.
I changed the configuration as follows (tensorflow-gpu 1.13.1):
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=FLAGS.num_gpus)
run_config = tf.estimator.RunConfig( log_step_count_steps = 10, save_summary_steps = 10, model_dir=FLAGS.output_dir, save_checkpoints_steps=FLAGS.iterations_per_loop, keep_checkpoint_max=5, train_distribute = distribution )
estimator = tf.estimator.Estimator( model_fn=model_fn, config=run_config, model_dir = FLAGS.output_dir, params = {'batch_size': FLAGS.batch_size} )
estimator.train(input_fn=train_input_fn, steps=num_train_steps)
However, I have the following error:
Has anyone maybe found a solution to this? Thanks.