tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.15k stars 634 forks source link

Running Benchmarks on Multiple Hosts with Multiple GPUs #182

Closed vilmara closed 6 years ago

vilmara commented 6 years ago

Hi, I need to run the benchmarks on a multi node system (2 hosts , each has 4 GPUs) using Imagenet dataset. Could you please assist providing the updated command lines to run the benchmarks under this configuration?

Thanks

ppwwyyxx commented 6 years ago

See https://www.tensorflow.org/performance/performance_models#executing_the_script

vilmara commented 6 years ago

@ppwwyyxx thanks for your prompt reply. I typed the below instructions on each host and but Igot the below errors:

Run the following commands on host_0 (10.0.0.1):

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 Error: 2018-05-15 17:37:23.851129: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

Run the following commands on host_1 (10.0.0.2):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1 Error: 2018-05-15 17:30:09.894755: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 17:30:09.894816: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1

ppwwyyxx commented 6 years ago

There are two commands on each host you need to run.

vilmara commented 6 years ago

@ppwwyyxx I ran it again with the 2 commands on each host and i am getting the same errors.

When the system process the first command, throws the following error: 2018-05-15 17:55:50.793470: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 17:55:58.969446: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

Then, it prints the training head info and then just prints the below lines and stay there without more outputs: Running parameter server 0 # in the case of host_0 Running parameter server 1 # in the case of host_1

I also have tried starting the PS first and then the worker, it didn't throw errors but each hosts printed only the below output after printing the head info: Running parameter server 0 # in the case of host_0 Running parameter server 1 # in the case of host_1

some recommendations or tips?

boriskovalev commented 6 years ago

Hi, I have same issues. Here I see two issues.

System information

OS Platform and Distribution: Linux Ubuntu 16.04 (Xenial) TensorFlow installed from: source TensorFlow version: v1.8.0-1-g8753e2e', '1.8.0 Python version: 2.7 Bazel version: 0.10 GCC/Compiler version: 5.4.0 CUDA/cuDNN version: 9.1, 7.0 GPU model and memory: Tesla p100 pci 16GB

Issues:

  1. A benchmark launch sequence is critical. Wrong launch sequence cause benchmark to fail. Same issue exists in early versions 0.x and was fixed in v1.0. Now it’s back.
  2. The below error happens on each distributed benchmark launch. We succeed to find root cause and fix Python code (see below) Error: tensorflow.python.framework.errors_impl. UnavailableError: OS Error from INFO:Tensorflow

Workarounds:

  1. Need to run by strict sequence when worker/PS with higher index is started first. Example: I run 4 PS + 4 Workers. So, launch sequence must be PS3->PS2->PS1->PS0->W3->W2->W1->W0.
  2. Issue created to TF dev: https://github.com/tensorflow/benchmarks/issues/162 Current manual solution is as below. Edit the benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py file and fix manual lines 1099, 1107 and 1111 to: worker_prefix = '/job:worker/replica:0/task:%s' % self.task_index '/job:ps/replica:0/task:%s/cpu:0' % i self.sync_queue_devices = ['/job:worker/replica:0/task:0/cpu:0'

But, still I see the error: "Master init: Unavailable: OS Error" then I run VGG16 with grpc protocol only and with 4 GPUs and more job. And sometimes it's works, 1 time from 10. With verb+grpc protocol I don't see it.

Exact command to reproduce

PS3 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=’’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=3 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc

PS2 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=’’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=2 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc

PS1 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=‘’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=1 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc

PS0 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=’’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=0 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc

Worker3
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=3 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc

Worker2 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=2 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc

Worker1 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=1 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc

Worker0 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=0 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc

vilmara commented 6 years ago

@ppwwyyxx, I used Horovod and ran distributed mode very fast and clean. Also here is a very good example of the command lines to be used with hovorod in docker Our benchmark on 64 GPUs #226

vilmara commented 6 years ago

Closing since I ran distributed benchmarks using horovod framework

vilmara commented 6 years ago

@tfboyd , @reedwm , @ppwwyyxx I ran well distributed TF benchmarks with hovorod; however, I still want to make works traditional distributed TF. I have done some progress but still getting the below errors:

On primary host: W0529 19:10:09.920276 139626556880640 tf_logging.py:126] Standard services need a 'logdir' passed to the SessionManager I0529 19:10:09.920442 139626556880640 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.932029: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932113: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.932781: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932891: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.935447: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1470: Skipping cancelled dequeue attempt with queue not closed

On secondary host: I0529 19:10:09.974736 140374619039488 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.938661: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1400: Skipping cancelled dequeue attempt with queue not closed I0529 19:10:15.539394 140374619039488 tf_logging.py:116] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Could not find worker with target: /job:worker/task:1/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1

P.S. After eliminating some flags I realized the errors are produced when using real data imagenet (instead of synthetic data) and the method used for managing variables in distributed mode. Can you guys recommend what is the method that produces the best throughput performance and what are the flags combination to be used?

tfboyd commented 6 years ago

I have not run the multi-node from this script in 6 months. The page with the commands, slightly wrong commands if I recall, will be removed in a few weeks. I would not call this traditional, no other code does the replicated distributed. As you ran, I suggest Horovod until there is an out of the box TF API.

Trying or fixing it does not get you a solution you should use in production unless you are rolling your own solution. What is the purpose of your test?

On Tue, May 29, 2018, 5:15 PM Vilmara notifications@github.com wrote:

@tfboyd https://github.com/tfboyd , @ppwwyyxx https://github.com/ppwwyyxx I ran well distributed TF benchmarks with hovorod; however, I still want to make works traditional distributed. I have done some progress but still getting the below errors:

On primary host: W0529 19:10:09.920276 139626556880640 tf_logging.py:126] Standard services need a 'logdir' passed to the SessionManager I0529 19:10:09.920442 139626556880640 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.932029: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932113: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.932781: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932891: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.935447: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1470: Skipping cancelled dequeue attempt with queue not closed

On secondary host: I0529 19:10:09.974736 140374619039488 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.938661: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1400: Skipping cancelled dequeue attempt with queue not closed

Some tips?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/182#issuecomment-392989752, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeslaR6XagWGQtahflpz8uDFg2vHX4ks5t3eSXgaJpZM4UAVmS .

vilmara commented 6 years ago

@tfboyd thanks, I am running benchmarks to compare servers performance.

tfboyd commented 6 years ago

@vilmara I would 100% use Horovod for that. I know NVIDIA uses a version of it internally with some tweaks and you can utilize MPI as well if you can enable that in your environment.