Open jjhursey opened 4 years ago
What I'm looking for is:
make distcheck
)IBM CI machines are located behind a firewall. As a result, our backend Jenkins service polls GitHub about every 5 minutes for changes to PRs. After a test finishes running we have custom scripts that push the results back as a Gist for the community to see.
Our Open MPI testing is not too specialized at this point. We do have a "virtual cluster" capability available to the community that can scale to 254 nodes on demand. We currently limit the community to 160 nodes, but that can be adjusted.
ppc64le
either Power8 or Power9.We run three concurrent builds. Currently, PGI is disabled but will be re-enabled soon.
./configure --prefix=/workspace/exports/ompi
./configure --prefix=/workspace/exports/ompi --disable-dlopen CC=xlc_r CXX=xlC_r FC=xlf_r
./configure --prefix=/workspace/exports/ompi --without-x CC=pgcc18 CXX=pgc++ FC=pgfortran
make -j 20
We run across 10 machines with the GNU build, and 2 machines with the other builds. The goal of this testing is to:
We run the following tests:
hello_c
hello_mpifh
hello_usempi
ring_c
ring_mpifh
ring_usempi
All run like:
shell$ /workspace/exports/ompi/bin/mpirun --hostfile /workspace/hostfile.txt --map-by ppr:4:node --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca pml ob1 --mca osc ^ucx --mca btl tcp,vader,self ring_c
A successful run through CI will take about 12-15 minutes. Most of that is building OMPI.
GNU:
autogen : (0:03:54)
configure : (0:02:22)
make : (0:03:22)
make install : (0:00:58)
XL:
autogen : (0:03:52)
configure : (0:03:41)
make : (0:05:29)
make install : (0:00:35)
PGI:
autogen : (0:03:33)
configure : (0:06:05)
make : (0:10:01)
make install : (0:01:08)
Mellanox Open MPI CI is intended to verify Open MPI with recent Mellanox SW components (Mellanox OFED, UCX and other HPC-X components) in the Mellanox lab environment.
CI is managed by Azure Pipelines service.
Mellanox Open MPI CI includes:
Related materials:
x86_64
(Intel Xeon E312xx (Sandy Bridge, IBRS update), 15 cores)Specific configure options (combinations may be used):
--with-platform=contrib/platform/mellanox/optimized
--with-ompi-param-check
--enable-picky
--enable-mpi-thread-multiple
--enable-opal-multi-threads
make -j$(nproc)
Build scenario:
./autogen.sh
./configure ...
make -j$(nproc) install
make -j$(nproc) check
Sanity tests (over UCX/HCOLL):
CI takes ~18-20 min. (mostly Open MPI building).
Thanks @artemry-mlnx for that information. Do you test with oshmem as well?
Should we be running make distcheck
in CI? Are there other OMPI integrity checks that we should be running on a routine basis?
I'm going to be out for a week, but don't let that stop progress.
Should we be running
make distcheck
in CI?
Yes! But do it in parallel to other CI jobs, because distcheck
takes a while. Make sure to use whatever the appropriate AM flag is to pass down a good -j
value for make
into the process so that the multiple builds that distcheck
does aren't performed serially. This can significantly speed up overall distcheck
time.
Are there other OMPI integrity checks that we should be running on a routine basis?
make check
?
Currently, our testing includes mtt with EFA and TCP. This tests v2.x, v3.0.x, v3.1.x, v4.0.x, and master. These are the configure options:
--oversubscribe --enable-picky --enable-debug --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio --enable-mca-no-build=io-ompio,common-ompio,sharedfp,fbtl,fcoll CC=xlc_r CXX=xlC_r FC=xlf_r --with-ofi=/opt/amazon/efa/
CFLAGS=-pipe --enable-picky --enable-debug --with-ofi=/opt/amazon/efa/
--enable-static --disable-shared
In our nightly, canary, and CI tests for libfabric, we always only use Open MPI 4.0.2 (Soon to be switched to 4.0.3). We use the release versions rather than pulling from the GitHub branch directly. These tests mainly run on our network optimized instance types, such as the c5n model types - https://aws.amazon.com/ec2/instance-types/
During the Open MPI face-to-face we discussed moving some of the CI checks to AWS which can harness some parallel instances to speed things up. Then each organization can focus on testing special configurations in their environments.
To start this discussion I'd like to see what tests various organizations are running in their CI. Once we have the list then we can work on removing duplicate efforts. We can use this repo as we need to help facilitate this coordination.
Please reply with a comment listing what you are testing now.