qzan9/osu-micro-benchmarks

OMB (OSU Micro-Benchmarks)

The OSU Micro-Benchmarks use the GNU build system. Therefore you can simply use the following steps to build the MPI benchmarks.

Example: ./configure CC=/path/to/mpicc CXX=/path/to/mpicxx make make install

CC and CXX can be set to other wrapper scripts as well to build OpenSHMEM or UPC++ benchmarks as well. Based on this setting, configure will detect whether your library supports MPI-1, MPI-2, MPI-3, OpenSHMEM, and UPC++ to compile the corresponding benchmarks. See http://mvapich.cse.ohio-state.edu/benchmarks/ to download the latest version of this package.

This package also distributes UPC put, get, and collective benchmarks. These are located in the upc subdirectory and can be compiled by the following:

    for bench in osu_upc_memput              \
                 osu_upc_memget              \
                 osu_upc_all_scatter         \
                 osu_upc_all_reduce          \
                 osu_upc_all_gather          \
                 osu_upc_all_gather_all      \
                 osu_upc_all_exchange        \
                 osu_upc_all_broadcast       \
                 osu_upc_all_barrier
    do
        echo "Compiling $bench..."
        upcc $bench.c ../util/osu_util.c -o $bench
    done

The MPI Multiple Bandwidth / Message Rate (osu_mbw_mr), OpenSHMEM Put Message Rate (osu_oshm_put_mr), and OpenSHMEM Atomics (osu_oshm_atomics) tests are intended to be used with block assigned ranks. This means that all processes on the same machine are assigned ranks sequentially.

Rank Block Cyclic

0 host1 host1 1 host1 host2 2 host1 host1 3 host1 host2 4 host2 host1 5 host2 host2 6 host2 host1 7 host2 host2

If you're using mpirun_rsh the ranks are assigned in the order they are seen in the hostfile or on the command line. Please see your process managers' documentation for information on how to control the distribution of the rank to host mapping.

Point-to-Point MPI Benchmarks

osu_latency - Latency Test

The latency tests are carried out in a ping-pong fashion. The sender
sends a message with a certain data size to the receiver and waits for a
reply from the receiver. The receiver receives the message from the sender
and sends back a reply with the same data size. Many iterations of this
ping-pong test are carried out and average one-way latency numbers are
obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are
used in the tests. This test is available here.

osu_latency_mt - Multi-threaded Latency Test

The multi-threaded latency test performs a ping-pong test with a single
sender process and multiple threads on the receiving process. In this test
the sending process sends a message of a given data size to the receiver
and waits for a reply from the receiver process. The receiving process has
a variable number of receiving threads (set by default to 2), where each
thread calls MPI_Recv and upon receiving a message sends back a response
of equal size. Many iterations are performed and the average one-way
latency numbers are reported. This test is available here.

osu_bw - Bandwidth Test

The bandwidth tests were carried out by having the sender sending out a
fixed number (equal to the window size) of back-to-back messages to the
receiver and then waiting for a reply from the receiver. The receiver
sends the reply only after receiving all these messages. This process is
repeated for several iterations and the bandwidth is calculated based on
the elapsed time (from the time sender sends the first message until the
time it receives the reply back from the receiver) and the number of bytes
sent by the sender. The objective of this bandwidth test is to determine
the maximum sustained date rate that can be achieved at the network level.
Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) were
used in the test. This test is available here.

osu_bibw - Bidirectional Bandwidth Test

The bidirectional bandwidth test is similar to the bandwidth test, except
that both the nodes involved send out a fixed number of back-to-back
messages and wait for the reply. This test measures the maximum
sustainable aggregate bandwidth by two nodes. This test is available here.

osu_mbw_mr - Multiple Bandwidth / Message Rate Test

The multi-pair bandwidth and message rate test evaluates the aggregate
uni-directional bandwidth and message rate between multiple pairs of
processes. Each of the sending processes sends a fixed number of messages
(the window size) back-to-back to the paired receiving process before
waiting for a reply from the receiver. This process is repeated for
several iterations. The objective of this benchmark is to determine the
achieved bandwidth and message rate from one node to another node with a
configurable number of processes running on each node. The test is
available here.

osu_multi_lat - Multi-pair Latency Test

This test is very similar to the latency test. However, at the same
instant multiple pairs are performing the same test simultaneously.
In order to perform the test across just two nodes the hostnames must
be specified in block fashion.

Collective MPI Benchmarks

osu_allgather - MPI_Allgather Latency Test() osu_allgatherv - MPI_Allgatherv Latency Test osu_allreduce - MPI_Allreduce Latency Test osu_alltoall - MPI_Alltoall Latency Test osu_alltoallv - MPI_Alltoallv Latency Test osu_barrier - MPI_Barrier Latency Test osu_bcast - MPI_Bcast Latency Test osu_gather - MPI_Gather Latency Test() osu_gatherv - MPI_Gatherv Latency Test osu_reduce - MPI_Reduce Latency Test osu_reduce_scatter - MPI_Reduce_scatter Latency Test osu_scatter - MPI_Scatter Latency Test(*) osu_scatterv - MPI_Scatterv Latency Test

Collective Latency Tests

The latest OMB version includes benchmarks for various MPI blocking
collective operations (MPI_Allgather, MPI_Alltoall, MPI_Allreduce,
MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter,
MPI_Scatter and vector collectives). These benchmarks work in the
following manner. Suppose users run the osu_bcast benchmark with N
processes, the benchmark measures the min, max and the average latency of
the MPI_Bcast collective operation across N processes, for various
message lengths, over a large number of iterations. In the default
version, these benchmarks report the average latency for each message
length. Additionally, the benchmarks offer the following options:
"-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations.
"-m" option can be used to set the minimum and maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths. Examples: -m 128 // min = default, max = 128 -m 2:128 // min = 2, max = 128 -m 2: // min = 2, max = default
"-x" can be used to set the number of warmup iterations to skip for each message length.
"-i" can be used to set the number of iterations to run for each message length.
"-M" can be used to set per process maximum memory consumption. By default the benchmarks are limited to 512MB allocations.

Support for CUDA Managed Memory

The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers allocated using CUDA Managed Memory.

* osu_bibw              - Bidirectional Bandwidth Test
* osu_bw                - Bandwidth Test
* osu_latency           - Latency Test
* osu_allgather         - MPI_Allgather Latency Test
* osu_allgatherv        - MPI_Allgatherv Latency Test
* osu_allreduce         - MPI_Allreduce Latency Test
* osu_alltoall          - MPI_Alltoall Latency Test
* osu_alltoallv         - MPI_Alltoallv Latency Test
* osu_bcast             - MPI_Bcast Latency Test
* osu_gather            - MPI_Gather Latency Test
* osu_gatherv           - MPI_Gatherv Latency Test
* osu_reduce            - MPI_Reduce Latency Test
* osu_reduce_scatter    - MPI_Reduce_scatter Latency Test
* osu_scatter           - MPI_Scatter Latency Test
* osu_scatterv          - MPI_Scatterv Latency Test

In addition to support for communications to and from GPU memories allocated using CUDA or OpenACC, we now provide additional capability of performing communications to and from buffers allocated using the CUDA Managed Memory concept. CUDA Managed (or Unified) Memory allows applications to allocate memory on either CPU or GPU memories using the cudaMallocManaged() call. This allows user oblivious transfer of the memory buffer between the CPU or GPU. Currently, we offer benchmarking with CUDA Managed Memory using the tests mentioned above.

These benchmarks have additional options:

"M" allocates a send or receive buffer as managed for point to point communication.
"-d managed" uses managed memory buffers to perform collective communications.

Non-Blocking Collective MPI Benchmarks

osu_iallgather - MPI_Iallgather Latency Test osu_iallgatherv - MPI_Iallgatherv Latency Test osu_iallreduce - MPI_Iallreduce Latency Test osu_ialltoall - MPI_Ialltoall Latency Test osu_ialltoallv - MPI_Ialltoallv Latency Test osu_ialltoallw - MPI_Ialltoallw Latency Test osu_ibarrier - MPI_Ibarrier Latency Test osu_ibcast - MPI_Ibcast Latency Test osu_igather - MPI_Igather Latency Test osu_igatherv - MPI_Igatherv Latency Test osu_ireduce - MPI_Ireduce Latency Test osu_iscatter - MPI_Iscatter Latency Test osu_iscatterv - MPI_Iscatterv Latency Test

Non-Blocking Collective Latency Tests

In addition to the blocking collective latency tests, we provide several
non-blocking collectives as mentioned above. These evaluate the same
metrics as the blocking operations as well as the additional metric
`overlap'. This is defined as the amount of computation that can be
performed while the communication progresses in the background.
These benchmarks have the additional option:
"-t" set the number of MPI_Test() calls during the dummy computation, set CALLS to 100, 1000, or any number > 0.

One-sided MPI Benchmarks

osu_put_latency - Latency Test for Put with Active/Passive Synchronization

The put latency benchmark includes window initialization operations
(MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
the origin process calls MPI_Put to directly place data of a certain size
in the remote process's window and then waiting on a synchronization call
(MPI_Win_complete) for completion. The remote process participates in
synchronization with MPI_Win_post and MPI_Win_wait calls. Several
iterations of this test is carried out and the average put latency
numbers is reported. The latency includes the synchronization time also.
For passive synchronization, suppose users run with MPI_Win_lock/unlock,
the origin process calls MPI_Win_lock to lock the target process's window
and calls MPI_Put to directly place data of certain size in the window.
Then it calls MPI_Win_unlock to ensure completion of the Put and release
lock on the window. This is carried out for several iterations and the
average time for MPI_Lock + MPI_Put + MPI_Unlock calls is measured. The
default window initialization and synchronization operations are
MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.
"-x" can be used to set the number of warmup iterations to skip for each message length.
"-i" can be used to set the number of iterations to run for each message length.

osu_get_latency - Latency Test for Get with Active/Passive Synchronization

The get latency benchmark includes window initialization operations
(MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
the origin process calls MPI_Get to directly fetch data of a certain size
from the target process's window into a local buffer. It then waits on a
synchronization call (MPI_Win_complete) for local completion of the Gets.
The remote process participates in synchronization with MPI_Win_post and
MPI_Win_wait calls. Several iterations of this test is carried out and
the average get latency numbers is reported. The latency includes the
synchronization time also. For passive synchronization, suppose users run
with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock
the target process's window and calls MPI_Get to directly read data of
certain size from the window. Then it calls MPI_Win_unlock to ensure
completion of the Get and releases lock on remote window. This is carried
out for several iterations and the average time for MPI_Lock + MPI_Get +
MPI_Unlock calls is measured. The default window initialization and
synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
benchmark offers the following options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate " use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.

osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization

The put bandwidth benchmark includes window initialization operations
(MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
the test is carried out by the origin process calling a fixed number of
back-to-back MPI_Puts on remote window and then waiting on a
synchronization call (MPI_Win_complete) for their completion. The remote
process participates in synchronization with MPI_Win_post and
MPI_Win_wait calls. This process is repeated for several iterations and
the bandwidth is calculated based on the elapsed time and the number of
bytes put by the origin process. For passive synchronization, suppose
users run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock
to lock the target process's window and calls a fixed number of
back-to-back MPI_Puts to directly place data in the window. Then it calls
MPI_Win_unlock to ensure completion of the Puts and release lock on
remote window. This process is repeated for several iterations and the
bandwidth is calculated based on the elapsed time and the number of bytes
put by the origin process. The default window initialization and
synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
benchmark offers the following options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.

osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization

The get bandwidth benchmark includes window initialization operations
(MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic) and
synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
the test is carried out by origin process calling a fixed number of
back-to-back MPI_Gets and then waiting on a synchronization call
(MPI_Win_complete) for their completion. The remote process participates
in synchronization with MPI_Win_post and MPI_Win_wait calls. This process
is repeated for several iterations and the bandwidth is calculated based
on the elapsed time and the number of bytes received by the origin
process. For passive synchronization, suppose users run with
MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to lock the
target process's window and calls a fixed number of back-to-back MPI_Gets
to directly get data from the window. Then it calls MPI_Win_unlock to
ensure completion of the Gets and release lock on the window. This
process is repeated for several iterations and the bandwidth is
calculated based on the elapsed time and the number of bytes read by the
origin process. The default window initialization and synchronization
operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers
the following options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization.

osu_put_bibw - Bi-directional Bandwidth Test for Put with Active Synchronization

The put bi-directional bandwidth benchmark includes window initialization
operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
and synchronization operations (MPI_Win_Post/Start/Complete/Wait and
MPI_Win_fence). This test is similar to the bandwidth test, except that
both the processes involved send out a fixed number of back-to-back
MPI_Puts and wait for their completion. This test measures the maximum
sustainable aggregate bandwidth by two processes. The default window
initialization and synchronization operations are MPI_Win_allocate and
MPI_Win_Post/Start/Complete/Wait. The benchmark offers the following
options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.

osu_acc_latency - Latency Test for Accumulate with Active/Passive Synchronization

The accumulate latency benchmark includes window initialization
operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
the origin process calls MPI_Accumulate to combine data from the local
buffer with the data in the remote window and store it in the remote
window. The combining operation used in the test is MPI_SUM. The origin
process then waits on a synchronization call (MPI_Win_complete) for
completion of the operations. The remote process waits on a MPI_Win_wait
call. Several iterations of this test are carried out and the average
accumulate latency number is obtained. The latency includes the
synchronization time also. For passive synchronization, suppose users
run with MPI_Win_lock/unlock, the origin process calls MPI_Win_lock to
lock the target process's window and calls MPI_Accumulate to combine data
from a local buffer with the data in the remote window and store it in
the remote window. Then it calls MPI_Win_unlock to ensure completion of
the Accumulate and release lock on the window. This is carried out for
several iterations and the average time for MPI_Lock + MPI_Accumulate +
MPI_Unlock calls is measured. The default window initialization and
synchronization operations are MPI_Win_allocate and MPI_Win_flush. The
benchmark offers the following options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.

osu_cas_latency - Latency Test for Compare and Swap with Active/Passive Synchronization

The Compare_and_swap latency benchmark includes window initialization
operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with
MPI_Win_Post/Start/Complete/Wait,the origin process calls
MPI_Compare_and_swap to place one element from origin buffer to target
buffer. The initial value in the target buffer is returned to the
calling process. The origin process then waits on a synchronization call
(MPI_Win_complete) for local completion of the operations. The remote
process waits on a MPI_Win_wait call. Several iterations of this test are
carried out and the average Compare_and_swap latency number is obtained.
The latency includes the synchronization time also. For passive
synchronization, suppose users run with MPI_Win_lock/unlock, the origin
process calls MPI_Win_lock to lock the target process's window and calls
MPI_Compare_and_swap to place one element from origin buffer to target
buffer. The initial value in the target buffer is returned to the calling
process. Then it calls MPI_Win_flush to ensure completion of the
Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on
the window. This is carried out for several iterations and the average
time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The
default window initialization and synchronization operations are
MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.

osu_fop_latency - Latency Test for Fetch and Op with Active/Passive Synchronization

The Fetch_and_op latency benchmark includes window initialization
operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
the origin process calls MPI_Fetch_and_op to increase the element in
target buffer by 1. The initial value from the target buffer is returned
to the calling process. The origin process waits on a synchronization
call (MPI_Win_complete) for completion of the operations. The remote
process waits on a MPI_Win_wait call. Several iterations of this test are
carried out and the average Fetch_and_op latency number is obtained. The
latency includes the synchronization time also. For passive
synchronization, suppose users run with MPI_Win_lock/unlock, the origin
process calls MPI_Win_lock to lock the target process's window and calls
MPI_Compare_and_swap to place one element from origin buffer to target
buffer. The initial value in the target buffer is returned to the calling
process. Then it calls MPI_Win_flush to ensure completion of the
Compare_and_swap. In the end, it calls MPI_Win_unlock to release lock on
the window. This is carried out for several iterations and the average
time for MPI_Compare_and_swap + MPI_Win_flush calls is measured. The
default window initialization and synchronization operations are
MPI_Win_allocate and MPI_Win_flush. The benchmark offers the following
options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.

osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive Synchronization

The Get_accumulate latency benchmark includes window initialization
operations (MPI_Win_create, MPI_Win_allocate and MPI_Win_create_dynamic)
and synchronization operations (MPI_Win_lock/unlock, MPI_Win_flush,
MPI_Win_flush_local, MPI_Win_lock_all/unlock_all,
MPI_Win_Post/Start/Complete/Wait and MPI_Win_fence). For active
synchronization, suppose users run with MPI_Win_Post/Start/Complete/Wait,
the origin process calls MPI_Get_accumulate to combine data from the
local buffer with the data in the remote window and store it in the
remote window. The combining operation used in the test is MPI_SUM. The
initial value from the target buffer is returned to the calling process.
The origin process waits on a synchronization call (MPI_Win_complete) for
local completion of the operations. The remote process waits on a
MPI_Win_wait call. Several iterations of this test are carried out and
the average get accumulate latency number is obtained. The latency
includes the synchronization time also. For passive synchronization,
suppose users run with MPI_Win_lock/unlock, the origin process calls
MPI_Win_lock to lock the target process's window and calls
MPI_Get_accumulate to combine data from a local buffer with the data in
the remote window and store it in the remote window. The initial value
from the target buffer is returned to the calling process. Then it calls
MPI_Win_unlock to ensure completion of the Get_accumulate and release
lock on the window. This is carried out for several iterations and the
average time for MPI_Lock + MPI_Get_accumulate + MPI_Unlock calls is
measured. The default window initialization and synchronization
operations are MPI_Win_allocate and MPI_Win_flush. The benchmark offers
the following options:
"-w create" use MPI_Win_create to create an MPI Window object.
"-w allocate" use MPI_Win_allocate to create an MPI Window object.
"-w dynamic" use MPI_Win_create_dynamic to create an MPI Window
object.
"-s lock" use MPI_Win_lock/unlock synchronizations calls.
"-s flush" use MPI_Win_flush synchronization call.
"-s flush_local" use MPI_Win_flush_local synchronization call.
"-s lock_all" use MPI_Win_lock_all/unlock_all synchronization calls.
"-s pscw" use Post/Start/Complete/Wait synchronization calls.
"-s fence" use MPI_Win_fence synchronization call.

Point-to-Point OpenSHMEM Benchmarks

osu_oshm_put.c - Latency Test for OpenSHMEM Put Routine

This benchmark measures latency of a shmem putmem operation for different
data sizes. The user is required to select whether the communication
buffers should be allocated in global memory or heap memory, through a
parameter. The test requires exactly two PEs. PE 0 issues shmem putmem to
write data at PE 1 and then calls shmem quiet. This is repeated for a
fixed number of iterations, depending on the data size. The average
latency per iteration is reported. A few warm-up iterations are run
without timing to ignore any start-up overheads. Both PEs call shmem
barrier all after the test for each message size.

osu_oshm_put_nb.c - Latency Test for OpenSHMEM Non-blocking Put Routine

This benchmark measures the non-blocking latency of a shmem putmem_nbi
operation for different data sizes. The user is required to select
whether the communication buffers should be allocated in global
memory or heap memory, through a parameter. The test requires exactly
two PEs. PE 0 issues shmem putmem_nbi to write data at PE 1 and then calls
shmem quiet. This is repeated for a fixed number of iterations, depending
on the data size. The average latency per iteration is reported.
A few warm-up iterations are run without timing to ignore any start-up
overheads. Both PEs call shmem barrier all after the test for each message size.

osu_oshm_get.c - Latency Test for OpenSHMEM Get Routine

This benchmark is similar to the one above except that PE 0 does a shmem
getmem operation to read data from PE 1 in each iteration. The average
latency per iteration is reported.

osu_oshm_get_nb.c - Latency Test for OpenSHMEM Non-blocking Get Routine

This benchmark is similar to the one above except that PE 0 does a shmem
getmem_nbi operation to read data from PE 1 in each iteration. The average
latency per iteration is reported.

osu_oshm_put_mr.c - Message Rate Test for OpenSHMEM Put Routine

This benchmark measures the aggregate uni-directional operation rate of
OpenSHMEM Put between pairs of PEs, for different data sizes. The user
should select for communication buffers to be in global memory and heap
memory as with the earlier benchmarks. This test requires number of PEs
to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on,
where n is the total number of PEs. The first PE in each pair issues
back-to-back shmem putmem operations to its peer PE. The total time for
the put operations is measured and operation rate per second is reported.
All PEs call shmem barrier all after the test for each message size.

osu_oshm_put_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Put Routine

This benchmark measures the aggregate uni-directional operation rate of
OpenSHMEM Non-blocking Put between pairs of PEs, for different data sizes.
The user should select for communication buffers to be in global memory
and heap memory as with the earlier benchmarks. This test requires number
of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on,
where n is the total number of PEs. The first PE in each pair issues
back-to-back shmem putmem_nbi operations to its peer PE until the window
size. A call to shmem_quite is placed after the window loop to ensure
completion of the issued operations. The total time for the non-blocking
put operations is measured and operation rate per second is reported.
All PEs call shmem barrier all after the test for each message size.

osu_oshm_get_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Get Routine

This benchmark measures the aggregate uni-directional operation rate of
OpenSHMEM Non-blocking Get between pairs of PEs, for different data sizes.
The user should select for communication buffers to be in global memory
and heap memory as with the earlier benchmarks. This test requires number
of PEs to be even. The PEs are paired with PE 0 pairing with PE n/2 and so on,
where n is the total number of PEs. The first PE in each pair issues
back-to-back shmem getmem_nbi operations to its peer PE until the window
size. A call to shmem_quite is placed after the window loop to ensure
completion of the issued operations. The total time for the non-blocking
put operations is measured and operation rate per second is reported.
All PEs call shmem barrier all after the test for each message size.

osu_oshm_put_overlap.c - Non-blocking Message Rate Overlap Test

This benchmark measures the aggregate uni-directional operations rate
overlap for OpenSHMEM Put between paris of PEs, for different data sizes.
The user should select for communication buffers to be in global memory
and heap memory as with the earlier benchmarks. This test requires number
of PEs. The benchmarks prints statistics for different phases of
communication, computation and overlap in the end.

osu_oshm_atomics.c - Latency and Operation Rate Test for OpenSHMEM Atomics Routines

This benchmark measures the performance of atomic fetch-and-operate and
atomic operate routines supported in OpenSHMEM for the integer
and long datatypes. The buffers can be selected to be in heap memory or global
memory. The PEs are paired like in the case of Put Operation Rate
benchmark and the first PE in each pair issues back-to-back atomic
operations of a type to its peer PE. The average latency per atomic
operation and the aggregate operation rate are reported. This is
repeated for each of fadd, finc, add, inc, cswap, swap, set, and fetch
routines.

Collective OpenSHMEM Benchmarks

osu_oshm_collect - OpenSHMEM Collect Latency Test osu_oshm_fcollect - OpenSHMEM FCollect Latency Test osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test osu_oshm_reduce - OpenSHMEM Reduce Latency Test osu_oshm_barrier - OpenSHMEM Barrier Latency Test

Collective Latency Tests

The latest OMB Version includes benchmarks for various OpenSHMEM
collective operations (shmem_collect, shmem_broadcast, shmem_reduce and
shmem_barrier). These benchmarks work in the following manner. Suppose
users run the osu_oshm_broadcast benchmark with N processes, the
benchmark measures the min, max and the average latency of the
shmem_broadcast collective operation across N processes, for various
message lengths, over a large number of iterations. In the default
version, these benchmarks report the average latency for each message
length. Additionally, the benchmarks offer the following options:
"-f" can be used to report additional statistics of the benchmark, such as min and max latencies and the number of iterations.
"-m" option can be used to set the maximum message length to be used in a benchmark. In the default version, the benchmarks report the latencies for up to 1MB message lengths.
"-i" can be used to set the number of iterations to run for each message length.

Point-to-Point UPC Benchmarks

osu_upc_memput.c - Put Latency

This benchmark measures the latency of upc put operation between multiple
UPC threads. In this bench- mark, UPC threads with ranks less than
(THREADS/2) issues upc memput operations to peer UPC threads. Peer
threads are identified as (MYTHREAD+THREADS/2). This is repeated for a
fixed number of iterations, for varying data sizes. The average latency
per iteration is reported. A few warm-up iterations are run without
timing to ignore any start-up overheads. All UPC threads call upc barrier
after the test for each message size.

osu_upc_memget.c - Get Latency

This benchmark is similar as the osu upc put benchmark that is described
above. The difference is that the shared string handling function is upc
memget. The average get operation latency per iteration is reported.

Collective UPC Benchmarks

osu_upc_all_barrier - UPC Barrier Latency Test osu_upc_all_broadcast - UPC Broadcast Latency Test osu_upc_all_scatter - UPC Scatter Latency Test osu_upc_all_gather - UPC Gather Latency Test osu_upc_all_gather_all - UPC GatherAll Latency Test osu_upc_all_reduce - UPC Reduce Latency Test osu_upc_all_exchange - UPC Exchange Latency Test

Collective Latency Tests

The latest OMB Version includes benchmarks for various UPC collective
operations (upc_all_barrier, upc_all_broadcast, upc_all_scatter,
upc_all_gather, upc_all_gather_all, osu_upc_all_reduce, and
upc_all_exchange). These benchmarks work in the following manner. Suppose
users run the osu_upc_all_broadcast benchmark with N processes, the
benchmark measures the min, max and the average latency of the
upc_all_broadcast collective operation across N processes, for various
message lengths, over a large number of iterations. In the default
version, these benchmarks report the average latency for each message
length. Additionally, the benchmarks offer the following options: "-f"
can be used to report additional statistics of the benchmark, such as min
and max latencies and the number of iterations. "-m" option can be used
to set the maximum message length to be used in a benchmark. In the
default version, the benchmarks report the latencies for up to 1MB
message lengths. "-i" can be used to set the number of iterations to run
for each message length.

Point-to-Point UPC++ Benchmarks

osu_upcxx_async_copy_put.c - Put Latency

This benchmark measures the latency of the UPC++ async_copy operation
between multiple UPC++ threads. In this benchmark, UPC+ threads with
ranks less than (THREADS/2) issues UPC++ async_copy from local to remote
memory on peer threads. Peer threads are identified as
(MYTHREAD+THREADS/2). This is repeated for a fixed number of iterations,
for varying data sizes. The average latency per iteration is reported. A
few warm-up iterations are run without timing to ignore any start-up
overheads. All UPC++ threads call barrier after the test for each message
size.

osu_upcxx_async_copy_get.c - Get Latency

This benchmark is similar as the osu_upcxx_async_copy_put benchmark that
is described above. The difference is that the async_copy operation
copies from remote to local memory. The average get operation latency per
iteration is reported.

Collective UPC++ Benchmarks

osu_upcxx_allgather - UPC++ Allgather Latency Test osu_upcxx_alltoall - UPC++ Alltoall Latency Test osu_upcxx_bcast - UPC++ Broadcast Latency Test osu_upcxx_gather - UPC++ Gather Latency Test osu_upcxx_reduce - UPC++ Reduce Latency Test osu_upcxx_scatter - UPC++ Scatter Latency Test

Collective Latency Tests

The latest OMB Version includes benchmarks for various UPC++ collective
operations (upcxx_allgather, upcxx_alltoall, upcxx_bcast, upcxx_gather,
upcxx_reduce, and upcxx_scatter). These benchmarks work in the following
manner. Suppose users run the osu_upcxx_bcast benchmark with N processes,
the benchmark measures the min, max and the average latency of the
upcxx_bcast collective operation across N processes, for various message
lengths, over a large number of iterations. In the default version, these
benchmarks report the average latency for each message length.
Additionally, the benchmarks offer the following options:
"-f" can be used to report additional statistics of the benchmark, such
as min and max latencies and the number of iterations.
"-m" option can be used to set the maximum message length to be used in a
benchmark. In the default version, the benchmarks report the latencies
for up to 1MB message lengths.
"-i" can be used to set the number of iterations to run for each message
length.

Startup Benchmarks

osu_init.c - This benchmark measures the minimum, maximum, and average time

each process takes to complete MPI_Init.

osu_hello.c - This is a simple hello world program. Users can take advantage of

this to time it takes for all processes to execute MPI_Init +
MPI_Finalize.
Example:
- time mpirun_rsh -np 2 -hostfile hostfile osu_hello

CUDA and OpenACC Extensions to OMB

CUDA Extensions to OMB can be enable by configuring the benchmark suite with --enable-cuda option as shown below. Similarly, OpenACC Extensions can be enabled by specifying the --enable-openacc option. The MPI library used should be able to support MPI communication from buffers in GPU Device memory.

./configure CC=/path/to/mpicc 
            CXX=/path/to/mpicxx
            --enable-cuda 
            --with-cuda-include=/path/to/cuda/include
            --with-cuda-libpath=/path/to/cuda/lib
make
make install

The following benchmarks have been extended to evaluate performance of MPI communication using buffers on NVIDIA GPU devices.

osu_bibw           - Bidirectional Bandwidth Test
osu_bw             - Bandwidth Test
osu_latency        - Latency Test
osu_put_latency    - Latency Test for Put
osu_get_latency    - Latency Test for Get
osu_put_bw         - Bandwidth Test for Put
osu_get_bw         - Bandwidth Test for Get
osu_put_bibw       - Bidirectional Bandwidth Test for Put
osu_acc_latency    - Latency Test for Accumulate
osu_cas_latency    - Latency Test for Compare and Swap
osu_fop_latency    - Latency Test for Fetch and Op
osu_allgather      - MPI_Allgather Latency Test
osu_allgatherv     - MPI_Allgatherv Latency Test
osu_allreduce      - MPI_Allreduce Latency Test
osu_alltoall       - MPI_Alltoall Latency Test
osu_alltoallv      - MPI_Alltoallv Latency Test
osu_bcast          - MPI_Bcast Latency Test
osu_gather         - MPI_Gather Latency Test
osu_gatherv        - MPI_Gatherv Latency Test
osu_reduce         - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter Latency Test
osu_scatter        - MPI_Scatter Latency Test
osu_scatterv       - MPI_Scatterv Latency Test
osu_iallgather     - MPI_Iallgather Latency Test
osu_iallgatherv    - MPI_Iallgatherv Latency Test
osu_iallreduce     - MPI_Iallreduce Latency Test
osu_ialltoall      - MPI_Ialltoall Latency Test
osu_ialltoallv     - MPI_Ialltoallv Latency Test
osu_ialltoallw     - MPI_Ialltoallw Latency Test
osu_ibcast         - MPI_Ibcast Latency Test
osu_igather        - MPI_Igather Latency Test
osu_igatherv       - MPI_Igatherv Latency Test
osu_ireduce        - MPI_Ireduce Latency Test
osu_iscatter       - MPI_Iscatter Latency Test
osu_iscatterv      - MPI_Iscatterv Latency Test

If both CUDA and OpenACC support is enabled you can switch between the modes using the -d [cuda|openacc] option to the benchmarks. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time. Use the -h option for more help.

./osu_latency -h
Usage: osu_latency [options] [RANK0 RANK1]

RANK0 and RANK1 may be `D' or `H' which specifies whether
the buffer is allocated on the accelerator device or host
memory for each mpi rank

options:
  -d TYPE   accelerator device buffers can be of TYPE `cuda' or `openacc'
  -h        print this help message

Each of the pt2pt benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host. The collective benchmarks will use buffers allocated on the device if the -d option is used otherwise the buffers will be allocated on the host.

Examples:

- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_latency D D

In this run, the latency test allocates buffers at both rank 0 and rank 1 on the GPU devices.

- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_bw D H

In this run, the bandwidth test allocates buffers at rank 0 on the GPU device and buffers at rank 1 on the host.

Setting GPU affinity

GPU affinity for processes is set before MPI_Init is called in the benchmarks. The process rank on a node is normally used to do this and different MPI launchers expose this information through different environment variables. The benchmarks use an environment variable called LOCAL_RANK to get this information.

Starting with OMB v5.4.4, the benchmarks automatically identify the process rank on a node for MVAPICH2 when launched with mpirun_rsh. However, a script like below can be used to export this environment variable when using OMB to work with other MPI launchers and libraries.

#!/bin/bash

export LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK
exec $*

A copy of this script is installed as get_local_rank alongside the benchmarks. It can be used as follows:

mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank \
    ./osu_latency D D