openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 421 forks source link

How to see which NIC ucx choose? #4929

Closed Keepmoving-ZXY closed 4 years ago

Keepmoving-ZXY commented 4 years ago

Hello, today I run TensorFlow MPI which support send Tensor content with OpenMPI. And I found that ucx's is the default transfer component of OpenMPI. And my script is:

#!/bin/bash
set -x

TF_BATCHSIZE=128
TF_MODEL=resnet50
TF_PROTOCOL=grpc+mpi

JOB_LIST='10.0.24.2:1,10.0.26.2:1'
# JOB_LIST='10.0.26.2:1'
/opt/openmpi/4.0.2/bin/mpirun -np 2 --host ${JOB_LIST} \
        -map-by node -mca pml ucx \
        -x TF_BATCHSIZE=$TF_BATCHSIZE \
        -x TF_MODEL=$TF_MODEL \
        -x TF_PROTOCOL=$TF_PROTOCOL \
        -x LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib \
        -x UCX_TLS=tcp \
        sh worker.sh

content of worker.sh is:

#!/bin/bash

# worker addr configure.
WORKER1='10.0.24.2:8000'
WORKER2='10.0.26.2:8000'
CONTROLLER='10.0.24.2:9000'

# tensorflow configure.
TF_TASKID=0
TRAIN_MODEL=${TF_MODEL}

PROTOCOL=${TF_PROTOCOL}
BATCH_SIZE_PER_GPU=${TF_BATCHSIZE}
ALL_REDUCE_ALG='xring'
VARIABLE_UPDATE='distributed_all_reduce'

export CUDA_VISIBLE_DEVICES="0,1"
export LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib/

python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
           --worker_hosts=${WORKER1},${WORKER2}\
           --controller_host=${CONTROLLER} \
           --job_name=worker \
           --variable_update=${VARIABLE_UPDATE} \
           --local_parameter_device=cpu \
           --use_fp16 --batch_size=${BATCH_SIZE_PER_GPU} \
           --force_gpu_compatible \
           --num_gpus=2 \
           --model=${TRAIN_MODEL} \
           --task_index=${TF_TASKID} \
           --server_protocol=${PROTOCOL} \
           --all_reduce_spec=${ALL_REDUCE_ALG}

In my scripts, I don't set UCX_NET_DEVICE, and in theory ucx should choose the NIC that has the max transfer speed, but ucx don't output the selected device. So how can I see which NIC the ucx finally use?

yosefe commented 4 years ago

add export UCX_LOG_LEVEL=info

Keepmoving-ZXY commented 4 years ago

This is all outout:

(zxy) [zxy@gpu5 benchmarks]$ sh mpi_controller.sh 
+ TF_BATCHSIZE=128
+ TF_MODEL=resnet50
+ TF_PROTOCOL=grpc+mpi
+ JOB_LIST=10.0.24.2:1
+ /opt/openmpi/4.0.2/bin/mpirun -np 1 --host 10.0.24.2:1 -mca pml ob1 -map-by node -mca btl tcp,vader,self -x TF_BATCHSIZE=128 -x TF_MODEL=resnet50 -x TF_PROTOCOL=grpc+mpi -x LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib -x UCX_LOG_LEVEL=info sh controller.sh
WARNING:tensorflow:From /home/zxy/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
W0324 12:34:01.106592 139683316840256 deprecation.py:323] From /home/zxy/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:129: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
W0324 12:34:01.120771 139683316840256 deprecation.py:323] From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:129: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:261: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling2d instead.
W0324 12:34:01.155802 139683316840256 deprecation.py:323] From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:261: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling2d instead.
WARNING:tensorflow:From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0324 12:34:03.199151 139683316840256 deprecation.py:323] From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0324 12:34:03.335773 139683316840256 deprecation.py:323] From /home/zxy/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0324 12:34:09.472959 139683316840256 deprecation.py:323] From /home/zxy/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Running local_init_op.
I0324 12:34:30.726897 139683316840256 session_manager.py:491] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0324 12:34:33.151669 139683316840256 session_manager.py:493] Done running local_init_op.
TensorFlow:  1.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  True
Batch size:  256 global
             128 per device
Num batches: 100
Num epochs:  0.02
Devices:     ['job:worker/replica:0/task0/gpu:0', 'job:worker/replica:0/task1/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   distributed_all_reduce
AllReduce:   xring
Sync:        True
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step    Img/sec total_loss
1   images/sec: 131.8 +/- 0.0 (jitter = 0.0)    7.876
10  images/sec: 133.9 +/- 1.3 (jitter = 4.5)    7.908
20  images/sec: 134.6 +/- 0.9 (jitter = 4.8)    7.741
30  images/sec: 133.8 +/- 0.8 (jitter = 5.3)    7.961
40  images/sec: 133.8 +/- 0.7 (jitter = 5.2)    7.840
50  images/sec: 134.0 +/- 0.6 (jitter = 4.4)    7.889
60  images/sec: 133.9 +/- 0.5 (jitter = 4.2)    7.846
70  images/sec: 134.1 +/- 0.5 (jitter = 3.4)    7.826
80  images/sec: 134.2 +/- 0.4 (jitter = 4.2)    7.841
90  images/sec: 134.1 +/- 0.4 (jitter = 4.3)    7.857
100 images/sec: 134.3 +/- 0.4 (jitter = 3.9)    7.812
----------------------------------------------------------------
total images/sec: 134.24
----------------------------------------------------------------
(zxy) [zxy@gpu5 benchmarks]$ sh mpi_worker.sh 
+ TF_BATCHSIZE=128
+ TF_MODEL=resnet50
+ TF_PROTOCOL=grpc+mpi
+ JOB_LIST=10.0.24.2:1,10.0.26.2:1
+ /opt/openmpi/4.0.2/bin/mpirun -np 2 --host 10.0.24.2:1,10.0.26.2:1 -map-by node -mca pml ucx -x TF_BATCHSIZE=128 -x TF_MODEL=resnet50 -x TF_PROTOCOL=grpc+mpi -x LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib -x UCX_TLS=tcp -x UCX_LOG_LEVEL=info sh worker.sh
+ WORKER1=10.0.24.2:8000
+ WORKER2=10.0.26.2:8000
+ CONTROLLER=10.0.24.2:9000
+ TF_TASKID=1
+ TRAIN_MODEL=resnet50
+ PROTOCOL=grpc+mpi
+ BATCH_SIZE_PER_GPU=128
+ ALL_REDUCE_ALG=xring
+ VARIABLE_UPDATE=distributed_all_reduce
+ export LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib
+ LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib
+ export CUDA_VISIBLE_DEVICES=0,1
+ CUDA_VISIBLE_DEVICES=0,1
+ /home/zxy/bin/python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --worker_hosts=10.0.24.2:8000,10.0.26.2:8000 --controller_host=10.0.24.2:9000 --job_name=worker --variable_update=distributed_all_reduce --local_parameter_device=cpu --use_fp16 --batch_size=128 --force_gpu_compatible --num_gpus=2 --model=resnet50 --task_index=1 --server_protocol=grpc+mpi --all_reduce_spec=xring
2020-03-24 12:34:20.755295: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2020-03-24 12:34:20.816197: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2020-03-24 12:34:21.136380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:61:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:21.416120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:62:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:21.416180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2020-03-24 12:34:21.419741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-24 12:34:21.419756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
2020-03-24 12:34:21.419763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
2020-03-24 12:34:21.419768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
2020-03-24 12:34:21.420830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0)
2020-03-24 12:34:21.421732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:1 with 30459 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0)
2020-03-24 12:34:24.867690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:61:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:25.131206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:62:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-03-24 12:34:25.131296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2020-03-24 12:34:25.134240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-24 12:34:25.134255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
2020-03-24 12:34:25.134262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
2020-03-24 12:34:25.134285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
2020-03-24 12:34:25.135390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0)
2020-03-24 12:34:25.136407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 30459 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0)
MPI Environment initialized. Process id: 0 Total processes: 2 || Hostname: gpu5.maas 
MPI Environment initialized. Process id: 1 Total processes: 2 || Hostname: gpu6.maas 
2020-03-24 12:34:25.215261: I tensorflow/contrib/mpi/mpi_utils.cc:41] MPI process-ID to gRPC server name map: 

2020-03-24 12:34:25.215289: I tensorflow/contrib/mpi/mpi_utils.cc:45] Process: 0    gRPC-name: worker:0:0

2020-03-24 12:34:25.215298: I tensorflow/contrib/mpi/mpi_utils.cc:45] Process: 1    gRPC-name: worker:0:1

2020-03-24 12:34:25.217218: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8000, 1 -> 10.0.26.2:8000}
2020-03-24 12:34:25.218391: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.24.2:8000, 1 -> localhost:8000}
2020-03-24 12:34:25.219733: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:8000
2020-03-24 12:34:25.220791: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:8000
TensorFlow:  1.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  True
Batch size:  512 global
             128 per device
Num batches: 100
Num epochs:  0.04
Devices:     ['job:worker/replica:0/task0/gpu:0', 'job:worker/replica:0/task0/gpu:1', 'job:worker/replica:0/task1/gpu:0', 'job:worker/replica:0/task1/gpu:1']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   distributed_all_reduce
AllReduce:   xring
Sync:        True
==========
Starting worker 0
TensorFlow:  1.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  True
Batch size:  512 global
             128 per device
Num batches: 100
Num epochs:  0.04
Devices:     ['job:worker/replica:0/task0/gpu:0', 'job:worker/replica:0/task0/gpu:1', 'job:worker/replica:0/task1/gpu:0', 'job:worker/replica:0/task1/gpu:1']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   distributed_all_reduce
AllReduce:   xring
Sync:        True
==========
Starting worker 1
2020-03-24 12:34:26.019660: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 9b718a2f6b884b4b with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 36 gpu_options { force_gpu_compatible: true experimental { } } allow_soft_placement: true graph_options { rewrite_options { pin_to_host_optimization: OFF } } experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
2020-03-24 12:34:45.470916: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-03-24 12:34:46.758058: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

The log shows that train terminated normally, but I can't see any output about the selected NIC by ucx, my script is:

#!/bin/bash
set -x

TF_BATCHSIZE=128
TF_MODEL=resnet50
TF_PROTOCOL=grpc+mpi

JOB_LIST='10.0.24.2:1,10.0.26.2:1'
# JOB_LIST='10.0.26.2:1'
/opt/openmpi/4.0.2/bin/mpirun -np 2 --host ${JOB_LIST} \
        -map-by node -mca pml ucx \
        -x TF_BATCHSIZE=$TF_BATCHSIZE \
        -x TF_MODEL=$TF_MODEL \
        -x TF_PROTOCOL=$TF_PROTOCOL \
        -x LD_LIBRARY_PATH=/opt/openmpi/4.0.2/lib \
        -x UCX_TLS=tcp \
        -x UCX_LOG_LEVEL=info \
        sh worker.sh

The UCX installed in GPU server come from Mellanox OFED driver.

yosefe commented 4 years ago

which UCX version is used?

Keepmoving-ZXY commented 4 years ago

This is the output of ucx_info -v:

(zxy) [zxy@gpu5 benchmarks]$ ucx_info -v 
# UCT version=1.6.0 revision f8b9db6
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --with-cuda=/usr/local/cuda-9.2
yosefe commented 4 years ago

I see, so printing the selected devices feature exists only since UCX v1.8.0

Keepmoving-ZXY commented 4 years ago

ok, I will upgrade ucx to v1.8.0 or later.

Keepmoving-ZXY commented 4 years ago

ucx v1.8.0 works, thank you.