Closed Keepmoving-ZXY closed 4 years ago
add export UCX_LOG_LEVEL=info
This is all outout:
(zxy) [zxy@gpu5 benchmarks]$ sh mpi_controller.sh
+ TF_BATCHSIZE=128
+ TF_MODEL=resnet50
+ TF_PROTOCOL=grpc+mpi
+ JOB_LIST=10.0.24.2:1
+ /opt/openmpi/4.0.2/bin/mpirun -np 1 --host 10.0.24.2:1 -mca pml ob1 -map-by node -mca btl tcp,vader,self -x TF_BATCHSIZE=128 -x TF_MODEL=resnet50 -x TF_PROTOCOL=grpc+mpi -x LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib -x UCX_LOG_LEVEL=info sh controller.sh
WARNING:tensorflow:From /home/zxy/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
W0324 12:34:01.106592 139683316840256 deprecation.py:323] From /home/zxy/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:129: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
W0324 12:34:01.120771 139683316840256 deprecation.py:323] From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:129: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:261: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling2d instead.
W0324 12:34:01.155802 139683316840256 deprecation.py:323] From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:261: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling2d instead.
WARNING:tensorflow:From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0324 12:34:03.199151 139683316840256 deprecation.py:323] From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0324 12:34:03.335773 139683316840256 deprecation.py:323] From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0324 12:34:09.472959 139683316840256 deprecation.py:323] From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Running local_init_op.
I0324 12:34:30.726897 139683316840256 session_manager.py:491] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0324 12:34:33.151669 139683316840256 session_manager.py:493] Done running local_init_op.
TensorFlow: 1.13
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: True
Batch size: 256 global
128 per device
Num batches: 100
Num epochs: 0.02
Devices: ['job:worker/replica:0/task0/gpu:0', 'job:worker/replica:0/task1/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: distributed_all_reduce
AllReduce: xring
Sync: True
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 131.8 +/- 0.0 (jitter = 0.0) 7.876
10 images/sec: 133.9 +/- 1.3 (jitter = 4.5) 7.908
20 images/sec: 134.6 +/- 0.9 (jitter = 4.8) 7.741
30 images/sec: 133.8 +/- 0.8 (jitter = 5.3) 7.961
40 images/sec: 133.8 +/- 0.7 (jitter = 5.2) 7.840
50 images/sec: 134.0 +/- 0.6 (jitter = 4.4) 7.889
60 images/sec: 133.9 +/- 0.5 (jitter = 4.2) 7.846
70 images/sec: 134.1 +/- 0.5 (jitter = 3.4) 7.826
80 images/sec: 134.2 +/- 0.4 (jitter = 4.2) 7.841
90 images/sec: 134.1 +/- 0.4 (jitter = 4.3) 7.857
100 images/sec: 134.3 +/- 0.4 (jitter = 3.9) 7.812
----------------------------------------------------------------
total images/sec: 134.24
----------------------------------------------------------------
(zxy) [zxy@gpu5 benchmarks]$ sh mpi_worker.sh
+ TF_BATCHSIZE=128
+ TF_MODEL=resnet50
+ TF_PROTOCOL=grpc+mpi
+ JOB_LIST=10.0.24.2:1,10.0.26.2:1
+ /opt/openmpi/4.0.2/bin/mpirun -np 2 --host 10.0.24.2:1,10.0.26.2:1 -map-by node -mca pml ucx -x TF_BATCHSIZE=128 -x TF_MODEL=resnet50 -x TF_PROTOCOL=grpc+mpi -x LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib -x UCX_TLS=tcp -x UCX_LOG_LEVEL=info sh worker.sh
+ WORKER1=10.0.24.2:8000
+ WORKER2=10.0.26.2:8000
+ CONTROLLER=10.0.24.2:9000
+ TF_TASKID=1
+ TRAIN_MODEL=resnet50
+ PROTOCOL=grpc+mpi
+ BATCH_SIZE_PER_GPU=128
+ ALL_REDUCE_ALG=xring
+ VARIABLE_UPDATE=distributed_all_reduce
+ export LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib
+ LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib
+ export CUDA_VISIBLE_DEVICES=0,1
+ CUDA_VISIBLE_DEVICES=0,1
+ /home/zxy/bin/python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --worker_hosts=10.0.24.2:8000,10.0.26.2:8000 --controller_host=10.0.24.2:9000 --job_name=worker --variable_update=distributed_all_reduce --local_parameter_device=cpu --use_fp16 --batch_size=128 --force_gpu_compatible --num_gpus=2 --model=resnet50 --task_index=1 --server_protocol=grpc+mpi --all_reduce_spec=xring
2020-03-24 12:34:20.755295: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2020-03-24 12:34:20.816197: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2020-03-24 12:34:21.136380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:61:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:21.416120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:62:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:21.416180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2020-03-24 12:34:21.419741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-24 12:34:21.419756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2020-03-24 12:34:21.419763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2020-03-24 12:34:21.419768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2020-03-24 12:34:21.420830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0)
2020-03-24 12:34:21.421732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:1 with 30459 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0)
2020-03-24 12:34:24.867690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:61:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:25.131206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:62:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:25.131296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2020-03-24 12:34:25.134240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-24 12:34:25.134255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2020-03-24 12:34:25.134262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2020-03-24 12:34:25.134285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2020-03-24 12:34:25.135390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0)
2020-03-24 12:34:25.136407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 30459 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0)
MPI Environment initialized. Process id: 0 Total processes: 2 || Hostname: gpu5.maas
MPI Environment initialized. Process id: 1 Total processes: 2 || Hostname: gpu6.maas
2020-03-24 12:34:25.215261: I tensorflow/contrib/mpi/mpi_utils.cc:41] MPI process-ID to gRPC server name map:
2020-03-24 12:34:25.215289: I tensorflow/contrib/mpi/mpi_utils.cc:45] Process: 0 gRPC-name: worker:0:0
2020-03-24 12:34:25.215298: I tensorflow/contrib/mpi/mpi_utils.cc:45] Process: 1 gRPC-name: worker:0:1
2020-03-24 12:34:25.217218: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8000, 1 -> 10.0.26.2:8000}
2020-03-24 12:34:25.218391: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.24.2:8000, 1 -> localhost:8000}
2020-03-24 12:34:25.219733: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:8000
2020-03-24 12:34:25.220791: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:8000
TensorFlow: 1.13
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: True
Batch size: 512 global
128 per device
Num batches: 100
Num epochs: 0.04
Devices: ['job:worker/replica:0/task0/gpu:0', 'job:worker/replica:0/task0/gpu:1', 'job:worker/replica:0/task1/gpu:0', 'job:worker/replica:0/task1/gpu:1']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: distributed_all_reduce
AllReduce: xring
Sync: True
==========
Starting worker 0
TensorFlow: 1.13
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: True
Batch size: 512 global
128 per device
Num batches: 100
Num epochs: 0.04
Devices: ['job:worker/replica:0/task0/gpu:0', 'job:worker/replica:0/task0/gpu:1', 'job:worker/replica:0/task1/gpu:0', 'job:worker/replica:0/task1/gpu:1']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: distributed_all_reduce
AllReduce: xring
Sync: True
==========
Starting worker 1
2020-03-24 12:34:26.019660: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 9b718a2f6b884b4b with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 36 gpu_options { force_gpu_compatible: true experimental { } } allow_soft_placement: true graph_options { rewrite_options { pin_to_host_optimization: OFF } } experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
2020-03-24 12:34:45.470916: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-03-24 12:34:46.758058: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
The log shows that train terminated normally, but I can't see any output about the selected NIC by ucx, my script is:
#!/bin/bash
set -x
TF_BATCHSIZE=128
TF_MODEL=resnet50
TF_PROTOCOL=grpc+mpi
JOB_LIST='10.0.24.2:1,10.0.26.2:1'
# JOB_LIST='10.0.26.2:1'
/opt/openmpi/4.0.2/bin/mpirun -np 2 --host ${JOB_LIST} \
-map-by node -mca pml ucx \
-x TF_BATCHSIZE=$TF_BATCHSIZE \
-x TF_MODEL=$TF_MODEL \
-x TF_PROTOCOL=$TF_PROTOCOL \
-x LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib \
-x UCX_TLS=tcp \
-x UCX_LOG_LEVEL=info \
sh worker.sh
The UCX installed in GPU server come from Mellanox OFED driver.
which UCX version is used?
This is the output of ucx_info -v
:
(zxy) [zxy@gpu5 benchmarks]$ ucx_info -v
# UCT version=1.6.0 revision f8b9db6
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --with-cuda=/usr/local/cuda-9.2
I see, so printing the selected devices feature exists only since UCX v1.8.0
ok, I will upgrade ucx to v1.8.0 or later.
ucx v1.8.0 works, thank you.
Hello, today I run TensorFlow MPI which support send Tensor content with OpenMPI. And I found that ucx's is the default transfer component of OpenMPI. And my script is:
content of
worker.sh
is:In my scripts, I don't set
UCX_NET_DEVICE
, and in theory ucx should choose the NIC that has the max transfer speed, but ucx don't output the selected device. So how can I see which NIC the ucx finally use?