tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.15k stars 632 forks source link

How to run the benchmark in the distributed mode? #65

Open yupeng9 opened 7 years ago

yupeng9 commented 7 years ago

Hi,

I followed the instructions from the [performance page]{https://www.tensorflow.org/performance/performance_models}, and run on two EC2 p2.8xlarge instances, using the same benchmark hash (Benchmark GitHub hash: 9165a70).

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

However, the worker failed with:

Generating model
save variable global_step:0
save variable ps_var/v0/conv0/conv2d/kernel:0
save variable ps_var/v0/conv0/biases:0
save variable ps_var/v0/conv1/conv2d/kernel:0
save variable ps_var/v0/conv1/biases:0
save variable ps_var/v0/conv2/conv2d/kernel:0
save variable ps_var/v0/conv2/biases:0
save variable ps_var/v0/conv3/conv2d/kernel:0
save variable ps_var/v0/conv3/biases:0
save variable ps_var/v0/conv4/conv2d/kernel:0
save variable ps_var/v0/conv4/biases:0
save variable ps_var/v0/affine0/weights:0
save variable ps_var/v0/affine0/biases:0
save variable ps_var/v0/affine1/weights:0
save variable ps_var/v0/affine1/biases:0
save variable ps_var/v0/affine2/weights:0
save variable ps_var/v0/affine2/biases:0
Traceback (most recent call last):
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
    bench.run()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
    self._benchmark_cnn()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 674, in _benchmark_cnn
    start_standard_services=start_standard_services) as sess:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,3,384,256]
         [[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
         [[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]

Caused by op u'v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform', defined at:
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
    bench.run()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
    self._benchmark_cnn()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 620, in _benchmark_cnn
    (enqueue_ops, fetches) = self._build_model()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 791, in _build_model
    gpu_grad_stage_ops)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 952, in add_forward_pass_and_gradients
    self.model.add_inference(network)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/alexnet_model.py", line 42, in add_inference
    cnn.conv(256, 3, 3)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 103, in conv
    use_bias=False)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 551, in conv2d
    return layer.apply(inputs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 503, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 443, in __call__
    self.build(input_shapes[0])
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 137, in build
    dtype=self.dtype)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 383, in add_variable
    trainable=trainable and self.trainable)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 84, in __call__
    return getter(name, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 199, in __init__
    expected_shape=expected_shape)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 277, in _init_from_args
    initial_value(), name="initial_value", dtype=dtype)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 701, in <lambda>
    shape.as_list(), dtype=dtype, partition_info=partition_info)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 441, in __call__
    dtype, seed=self.seed)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 240, in random_uniform
    shape, dtype, seed=seed1, seed2=seed2)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 247, in _random_uniform
    seed=seed, seed2=seed2, name=name)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,384,256]
         [[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
         [[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]

It seems each TF process will allocate all of available GPU memory, so the worker cannot get any memory if I start the parameter server command first.

Likewise, if I run worker first, then the parameter server cannot get any memory.

ppwwyyxx commented 7 years ago

I always started the worker first and then started PS with CUDA_VISIBLE_DEVICES=

yupeng9 commented 7 years ago

Right, if I start the worker first, then the PS will also show OOM error. Will CUDA_VISIBLE_DEVICES disable the GPU devices for PS?

btw, if this is required, then can someone update the official guide?

tfboyd commented 7 years ago

I forgot to update it with the CUDA_VISIBLE_DEVICES. I made some other mistakes as well in copying the commands I used to run the benchmark.. I use a wrapper to run the tests that manage the args for me and was not careful enough when typing them out by hand. I will try to find time to update the information and get it pushed out. Pushes to the website can take a long time. I will see if I do the change and speed up the publish.

On Thu, Oct 12, 2017 at 2:11 PM, Yupeng Fu notifications@github.com wrote:

Right, if I start the worker first, then the PS will also show OOM error. Will CUDA_VISIBLE_DEVICES disable the GPU devices for PS?

btw, if this is required, then can someone update the official guide?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/65#issuecomment-336270720, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZesoEP4N8DI86naTPEtAZ6RC7Kg6DAks5sroB_gaJpZM4P3nP0 .

tfboyd commented 7 years ago

@yupeng9 If you are doing distributed TensorFlow on just a few servers I would check out this example that includes tensorboard outputs and other nice features like automatic evaluation. Or you could try the Uber project that is also nice for distributed that I have not personally tested but I have seen their results and they are good. We are working on a nicer high level API in TensorFlow for distributed but the above options are currently the best.

yupeng9 commented 7 years ago

@tfboyd thanks for the information.

Since pushing to the website can take a while, do you mind posting the instructions here once you have it?

I took a look at cifar10. Is there a plan to migrate tf_cnn_benchmark to include those additional features? A nice thing I see in tf_cnn_benchmark is that it is more like a general benchmark test bed: it supports multiple models as well as different data sets, and therefore it also allows future additions.

More importantly, Tensorflow website publishes useful results from this benchmark, so it has great reference values.

DjangoPeng commented 6 years ago

@yupeng9 What's the process of the distributed testing. I'm starting to run the distributed TensorFlow benchmarks. @tfboyd It seems like the official guide is still not been updated?

Zhaojp-Frank commented 6 years ago

+1 any update on the latest doc on distributed training steps? thanks.

tfboyd commented 6 years ago

I doubt I will update the web page anytime soon. I must have been in a hurry when I typed up that page, I also use my own testing harness that builds the commands and I likely failed to copy and past my exact commands from the logs. I did test the what is likely the most recent code on AWS two weeks ago and everything seemed fine with TF 1.4. It was a very small test with 2x p2.8xlarge instances.

I would suggest people not use this code unless they are going to write their own distributed or multi-GPU setup and can understand the variable management aspects. We use this code to test new ideas and a lot of different variations that are not matrix tested, meaning option A may not even work with option D and that will not be documented. I am putting all of my time into helping the team get clean examples published with known accuracy numbers over the next few months.

reedwm commented 6 years ago

As @ppwwyyxx stated, when running the parameter servers on the same hosts as the workers, you should prefix the parameter server commands with CUDA_VISIBLE_DEVICES=. This hides the GPUs from TensorFlow so it will not use them or allocate the memory on them. I haven't tried myself, but the updated commands should be:

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

I'm currently blocked by this issue, but afterwards, once I have time, I can update the README (and the website once I figure out how) with the updated commands.

DjangoPeng commented 6 years ago

@reedwm How about setting CUDA_VISIBLE_DEVICES={0..7} for corresponding worker? Such like GPU 0 for Worker 0. The command should be:

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0
reedwm commented 6 years ago

In the above example, each worker on is a separate machine, since they have different IP addresses (10.0.0.1 and 10.0.0.2). So, they will each have their own set of 8 GPUs, and so CUDA_VISIBLE_DEVICES should not be set.

If multiple worker processes are run on the same machine, your strategy of setting CUDA_VISIBLE_DEVICES will work. But it's better to run a single worker per machine and have each worker use all the GPUs on the machine.

DjangoPeng commented 6 years ago

Yep! I know the trick of setting CUDA_VISIBLE_DEVICES. But I just have 3 machines, and 2 1080Ti per machine. So, the recommended cluster specification is 3 parameter servers and 3 workers. Besides, a pair of ps and worker per machine. Am I right?

reedwm commented 6 years ago

Yep, that is correct. On each machine, the worker will have access to both GPUs, and the parameter server will not since CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks ... will be run.

Zhaojp-Frank commented 6 years ago

@reedwm question about the start order. for example, in the same hostA, once run above cmd to start worker, the shall would not return, instead it keeps running, e.g. trying to start the session. so when shall I start PS? any strict order required. I have suffered numbers of errors such as 'Attempting to use uninitialized value p', 'expects a different device' . it will be great to document the start order info.

DjangoPeng commented 6 years ago

@Zhaojp-Frank Generally speaking, you'd better launch PS process before worker 0. If no ps is running well, worker 0 would throw the uninitialized error.

reedwm commented 6 years ago

You should be able to launch the processes in any order. @DjangoPeng what's the commands you use that sometimes cause an uninitialized error?

abidmalikwaterloo commented 6 years ago

Do we have to kill the parameter servers manually when the job is done?

reedwm commented 6 years ago

@abidmalik1967, yes.

vilmara commented 6 years ago

Hi @reedwm / @tfboyd, I am running the benchmarks on a multi node system (2 hosts , each has 4 GPUs) following the below instructions on https://www.tensorflow.org/performance/performance_models#executing_the_script , but I am getting errors (notice I replaced python by python3 and used --num_gpus=4 for each host)

Run the following commands on host_0 (10.0.0.1):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

Run the following commands on host_1 (10.0.0.2):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

When the system processes the first command, it throws the following error on each host:

host_0 output: 2018-05-15 18:32:29.136718: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:32:29.136759: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:29.136775: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:37.369403: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

host_1 output: 2018-05-15 18:32:47.220352: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:47.220364: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0 2018-05-15 18:32:54.466053: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

When runs the second command, it prints the training info, and after just prints the below lines and doesn't produce more outputs, the processes look like on hold on each host Running parameter server 0 # in the case of host_0 Running parameter server 1 # in the case of host_1

abidmalikwaterloo commented 6 years ago

If you want to try distributed learning, try Horovod.

https://github.com/uber/horovod

Its much cleaner and gives better performance.

On Tue, May 15, 2018 at 7:45 PM, Vilmara notifications@github.com wrote:

Hi @reedwm https://github.com/reedwm / @tfboyd https://github.com/tfboyd, I am running the benchmarks on a multi node system (2 hosts , each has 4 GPUs) following the below instructions on [ https://www.tensorflow.org/performance/performance_ models#executing_the_script] , but I am getting errors (notice I replaced python by python3 and used --num_gpus=4 for each host) Run the following commands on host_0 (10.0.0.1):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 Run the following commands on host_1 (10.0.0.2):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

Output on host 1( similar error on the second host): C4-1:~/benchmarks/scripts/tf_cnn_benchmarks$ python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \

--batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 WARNING: Logging before flag parsing goes to stderr. W0515 18:30:07.299445 140167654033152 tflogging.py:126] From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/ learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. 2018-05-15 18:30:10.839398: I tensorflow/core/common runtime/gpu/gpudevice.cc:1344] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:05:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.080912: I tensorflow/core/common runtime/gpu/gpu_device.cc:1344] Found device 1 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:06:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.293176: I tensorflow/stream_executor/cuda/cuda_gpuexecutor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:30:11.293913: I tensorflow/core/common runtime/gpu/gpu_device.cc:1344] Found device 2 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:85:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.512889: I tensorflow/stream_executor/cuda/cuda_gpuexecutor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:30:11.513651: I tensorflow/core/common runtime/gpu/gpudevice.cc:1344] Found device 3 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:86:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.516938: I tensorflow/core/common runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3 2018-05-15 18:30:12.718971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-05-15 18:30:12.719022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1 2 3 2018-05-15 18:30:12.719030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y N N 2018-05-15 18:30:12.719036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N N N 2018-05-15 18:30:12.719042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 2: N N N Y 2018-05-15 18:30:12.719049: I tensorflow/core/common_runtime/gpu/gpudevice.cc:930] 3: N N Y N 2018-05-15 18:30:12.720414: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 15133 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:05:00.0, compute capability: 6.0) 2018-05-15 18:30:12.968743: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 15133 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 6.0) 2018-05-15 18:30:13.276930: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:2 with 15133 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0) 2018-05-15 18:30:13.608508: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:3 with 15133 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:86:00.0, compute capability: 6.0) 2018-05-15 18:30:13.949687: I tensorflow/core/distributed runtime/rpc/grpcchannel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.0.0.1:50000, 1 -> 10.0.0.2:50000} 2018-05-15 18:30:13.949734: I tensorflow/core/distributed runtime/rpc/grpcchannel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:50001, 1 -> 10.0.0.2:50001} 2018-05-15 18:30:13.954717: I tensorflow/core/distributed runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:50001 TensorFlow: 1.7 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 512 global 64.0 per device Num batches: 100 Num epochs: 0.04 Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: distributed_replicated Sync: True

Generating model W0515 18:30:26.539938 140167654033152 tf_logging.py:126] From /home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1525: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2018-05-15 18:30:39.134526: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:30:39.134602: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:30:39.134617: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:30:49.134698: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:30:49.134738: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:30:49.134751: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:30:59.134867: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:30:59.134906: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:30:59.134919: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:09.135051: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:09.135090: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:09.135103: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:19.135229: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:19.135268: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:19.135283: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:29.135525: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:29.135565: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:29.135576: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:39.135715: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:39.135755: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:39.135767: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:49.135890: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:49.135930: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:49.135941: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:59.136066: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:59.136105: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:59.136120: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:09.136357: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:32:09.136397: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:09.136409: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:19.136545: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:32:19.136585: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:19.136597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:29.136718: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:32:29.136759: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:29.136775: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:37.369403: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error 2018-05-15 18:32:37.369545: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error 2018-05-15 18:32:39.136914: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:49.137081: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:51.321405: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error I0515 18:32:51.462898 140167654033152 tf_logging.py:116] Error reported to Coordinator: <class 'tensorflow.python.framework. errors_impl.UnavailableError'>, OS Error Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1310, in _run_fn self._extend_graph() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1358, in _extend_graph graph_def.SerializeToString(), status) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/framework/errors_impl.py", line 516, in exit* c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tf_cnn_benchmarks.py", line 60, in app.run(main) # Raises error on invalid flags, unlike tf.app.run() File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 274, in run _run_main(main, argv) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 238, in _run_main sys.exit(main(argv)) File "tf_cnn_benchmarks.py", line 56, in main bench.run() File "/home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1306, in run return self._benchmark_cnn() File "/home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1535, in _benchmark_cnn start_standard_services=start_standard_services) as sess: File "/usr/lib/python3.5/contextlib.py", line 59, in enter return next(self.gen) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 726, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/session_manager.py", line 281, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnavailableError: OS Error user@C4-1:/benchmarks/scripts/tf_cnn_benchmarks$ user@C4-1:/benchmarks/scripts/tf_cnn_benchmarks$ python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \

--batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 WARNING: Logging before flag parsing goes to stderr. W0515 18:32:57.843850 140232479717120 tflogging.py:126] From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/ learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. 2018-05-15 18:33:01.492719: I tensorflow/core/common runtime/gpu/gpudevice.cc:1344] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:05:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:01.696742: I tensorflow/core/common runtime/gpu/gpu_device.cc:1344] Found device 1 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:06:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:01.911120: I tensorflow/stream_executor/cuda/cuda_gpuexecutor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:33:01.912332: I tensorflow/core/common runtime/gpu/gpu_device.cc:1344] Found device 2 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:85:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:02.155386: I tensorflow/stream_executor/cuda/cuda_gpuexecutor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:33:02.156112: I tensorflow/core/common runtime/gpu/gpudevice.cc:1344] Found device 3 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:86:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:02.159411: I tensorflow/core/common runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3 2018-05-15 18:33:03.404523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-05-15 18:33:03.404573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1 2 3 2018-05-15 18:33:03.404582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y N N 2018-05-15 18:33:03.404588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N N N 2018-05-15 18:33:03.404594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 2: N N N Y 2018-05-15 18:33:03.404601: I tensorflow/core/common_runtime/gpu/gpudevice.cc:930] 3: N N Y N 2018-05-15 18:33:03.405971: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 15133 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:05:00.0, compute capability: 6.0) 2018-05-15 18:33:03.734283: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 15133 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 6.0) 2018-05-15 18:33:04.081842: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:2 with 15133 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0) 2018-05-15 18:33:04.414207: I tensorflow/core/common runtime/gpu/gpudevice.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:3 with 15133 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:86:00.0, compute capability: 6.0) 2018-05-15 18:33:04.744345: I tensorflow/core/distributed runtime/rpc/grpcchannel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:50000, 1 -> 10.0.0.2:50000} 2018-05-15 18:33:04.744393: I tensorflow/core/distributed runtime/rpc/grpcchannel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.1:50001, 1 -> 10.0.0.2:50001} 2018-05-15 18:33:04.748917: I tensorflow/core/distributed runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:50000 TensorFlow: 1.7 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 512 global 64.0 per device Num batches: 100 Num epochs: 0.04 Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: distributed_replicated Sync: True

Running parameter server 0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/65#issuecomment-389349124, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbSULFXyDCfIruVAfJE4t1MjzaRMx6iks5ty2iEgaJpZM4P3nP0 .

-- Abid M. Malik


"I have learned silence from the talkative, toleration from the intolerant, and kindness from the unkind"---Gibran "Success is not for the chosen few, but for the few who choose" --- John Maxwell "Being a good person does not depend on your religion or status in life, your race or skin color, political views or culture. IT DEPENDS ON HOW GOOD YOU TREAT OTHERS"--- Abid "The Universe is talking to us, and the language of the Universe is mathematics."----Abid