Closed vilmara closed 6 years ago
@ppwwyyxx thanks for your prompt reply. I typed the below instructions on each host and but Igot the below errors:
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 Error: 2018-05-15 17:37:23.851129: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated \ --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \ --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1 Error: 2018-05-15 17:30:09.894755: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 17:30:09.894816: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
There are two commands on each host you need to run.
@ppwwyyxx I ran it again with the 2 commands on each host and i am getting the same errors.
When the system process the first command, throws the following error: 2018-05-15 17:55:50.793470: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 17:55:58.969446: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
Then, it prints the training head info and then just prints the below lines and stay there without more outputs: Running parameter server 0 # in the case of host_0 Running parameter server 1 # in the case of host_1
I also have tried starting the PS first and then the worker, it didn't throw errors but each hosts printed only the below output after printing the head info: Running parameter server 0 # in the case of host_0 Running parameter server 1 # in the case of host_1
some recommendations or tips?
Hi, I have same issues. Here I see two issues.
OS Platform and Distribution: Linux Ubuntu 16.04 (Xenial) TensorFlow installed from: source TensorFlow version: v1.8.0-1-g8753e2e', '1.8.0 Python version: 2.7 Bazel version: 0.10 GCC/Compiler version: 5.4.0 CUDA/cuDNN version: 9.1, 7.0 GPU model and memory: Tesla p100 pci 16GB
Issues:
Workarounds:
But, still I see the error: "Master init: Unavailable: OS Error" then I run VGG16 with grpc protocol only and with 4 GPUs and more job. And sometimes it's works, 1 time from 10. With verb+grpc protocol I don't see it.
Exact command to reproduce
PS3 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=’’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=3 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc
PS2 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=’’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=2 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc
PS1 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=‘’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=1 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc
PS0 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 CUDA_VISIBLE_DEVICES=’’ python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=ps --task_index=0 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --server_protocol=grpc
Worker3
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=3 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc
Worker2 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=2 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc
Worker1 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=1 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc
Worker0 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/gdrcopy TF_CPP_MIN_VLOG_LEVEL=0 python -u /tmp/tmp.tYRLIMH6fIKA/tf_cnn_benchmarks.py --job_name=worker --task_index=0 --worker_hosts=12.12.12.41:51004,12.12.12.42:51005,12.12.12.43:51006,12.12.12.44:51007 --ps_hosts=12.12.12.41:51000,12.12.12.42:51001,12.12.12.43:51002,12.12.12.44:51003 --model=vgg16 --batch_size=64 --num_gpus=1 --local_parameter_device=gpu --server_protocol=grpc
@ppwwyyxx, I used Horovod and ran distributed mode very fast and clean. Also here is a very good example of the command lines to be used with hovorod in docker Our benchmark on 64 GPUs #226
Closing since I ran distributed benchmarks using horovod framework
@tfboyd , @reedwm , @ppwwyyxx I ran well distributed TF benchmarks with hovorod; however, I still want to make works traditional distributed TF. I have done some progress but still getting the below errors:
On primary host: W0529 19:10:09.920276 139626556880640 tf_logging.py:126] Standard services need a 'logdir' passed to the SessionManager I0529 19:10:09.920442 139626556880640 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.932029: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932113: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.932781: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932891: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.935447: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1470: Skipping cancelled dequeue attempt with queue not closed
On secondary host: I0529 19:10:09.974736 140374619039488 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.938661: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1400: Skipping cancelled dequeue attempt with queue not closed I0529 19:10:15.539394 140374619039488 tf_logging.py:116] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Could not find worker with target: /job:worker/task:1/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1
P.S. After eliminating some flags I realized the errors are produced when using real data imagenet (instead of synthetic data) and the method used for managing variables in distributed mode. Can you guys recommend what is the method that produces the best throughput performance and what are the flags combination to be used?
I have not run the multi-node from this script in 6 months. The page with the commands, slightly wrong commands if I recall, will be removed in a few weeks. I would not call this traditional, no other code does the replicated distributed. As you ran, I suggest Horovod until there is an out of the box TF API.
Trying or fixing it does not get you a solution you should use in production unless you are rolling your own solution. What is the purpose of your test?
On Tue, May 29, 2018, 5:15 PM Vilmara notifications@github.com wrote:
@tfboyd https://github.com/tfboyd , @ppwwyyxx https://github.com/ppwwyyxx I ran well distributed TF benchmarks with hovorod; however, I still want to make works traditional distributed. I have done some progress but still getting the below errors:
On primary host: W0529 19:10:09.920276 139626556880640 tf_logging.py:126] Standard services need a 'logdir' passed to the SessionManager I0529 19:10:09.920442 139626556880640 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.932029: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932113: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.932781: W tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:237] Replica ID must be 0 in target: /job:worker/task:0/device:CPU:0 2018-05-29 19:10:14.932891: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at prefetching_kernels.cc:304 : Invalid argument: Could not find worker with target: /job:worker/task:0/device:CPU:0 Available workers: /job:ps/replica:0/task:0, /job:ps/replica:0/task:1, /job:worker/replica:0/task:0, /job:worker/replica:0/task:1 2018-05-29 19:10:14.935447: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1470: Skipping cancelled dequeue attempt with queue not closed
On secondary host: I0529 19:10:09.974736 140374619039488 tf_logging.py:116] Starting queue runners. Running warm up 2018-05-29 19:10:14.938661: W tensorflow/core/kernels/queue_base.cc:285] replicate_variable_1400: Skipping cancelled dequeue attempt with queue not closed
Some tips?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/182#issuecomment-392989752, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeslaR6XagWGQtahflpz8uDFg2vHX4ks5t3eSXgaJpZM4UAVmS .
@tfboyd thanks, I am running benchmarks to compare servers performance.
@vilmara I would 100% use Horovod for that. I know NVIDIA uses a version of it internally with some tweaks and you can utilize MPI as well if you can enable that in your environment.
Hi, I need to run the benchmarks on a multi node system (2 hosts , each has 4 GPUs) using Imagenet dataset. Could you please assist providing the updated command lines to run the benchmarks under this configuration?
Thanks