jjhursey commented 4 years ago

During the Open MPI face-to-face we discussed moving some of the CI checks to AWS which can harness some parallel instances to speed things up. Then each organization can focus on testing special configurations in their environments.

To start this discussion I'd like to see what tests various organizations are running in their CI. Once we have the list then we can work on removing duplicate efforts. We can use this repo as we need to help facilitate this coordination.

Please reply with a comment listing what you are testing now.

jjhursey commented 4 years ago

@artemry-mlnx Can you list what Mellanox is testing in it's CI?
@bwbarrett or @wckzhang Can you list what AWS is testing in it's CI?
@hppritcha Can you list what the Cray builder is testing in it's CI?
I'll add what IBM is testing.
Is there anyone else that should be in the conversation?

What I'm looking for is:

Platform (including architecture, networking, and specialized hardware)
Configure options (including any components that you want to make sure are built)
Build options
Other build types (e.g., make distcheck)
Test runs (what are you testing for?)

jjhursey commented 4 years ago

IBM CI

IBM CI machines are located behind a firewall. As a result, our backend Jenkins service polls GitHub about every 5 minutes for changes to PRs. After a test finishes running we have custom scripts that push the results back as a Gist for the community to see.

Our Open MPI testing is not too specialized at this point. We do have a "virtual cluster" capability available to the community that can scale to 254 nodes on demand. We currently limit the community to 160 nodes, but that can be adjusted.

Platform

ppc64le either Power8 or Power9.
Infiniband and TCP networking. Currently, our CI only tests TCP.
NVIDIA GPUs
Compilers:
- GNU 4.8.5
- IBM XL V16.1.1
- PGI 19.10-0

Configure options

We run three concurrent builds. Currently, PGI is disabled but will be re-enabled soon.

GNU Build: ./configure --prefix=/workspace/exports/ompi
XL Build: ./configure --prefix=/workspace/exports/ompi --disable-dlopen CC=xlc_r CXX=xlC_r FC=xlf_r
PGI Build: ./configure --prefix=/workspace/exports/ompi --without-x CC=pgcc18 CXX=pgc++ FC=pgfortran

Build options

make -j 20

Other build types

None

Tests run

We run across 10 machines with the GNU build, and 2 machines with the other builds. The goal of this testing is to:

Verify that the build is functional and can pass messages. So we test C and Fortran in a non-communicating and communicating program.
The multi-host launch works correctly.

We run the following tests:

Open MPI examples
- hello_c
- hello_mpifh
- hello_usempi
- ring_c
- ring_mpifh
- ring_usempi

All run like:

shell$ /workspace/exports/ompi/bin/mpirun --hostfile /workspace/hostfile.txt --map-by ppr:4:node  --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca pml ob1 --mca osc ^ucx --mca btl tcp,vader,self ring_c

Timing

A successful run through CI will take about 12-15 minutes. Most of that is building OMPI.

GNU:
autogen      : (0:03:54)
configure    : (0:02:22)
make         : (0:03:22)
make install : (0:00:58)

XL:
autogen      : (0:03:52)
configure    : (0:03:41)
make         : (0:05:29)
make install : (0:00:35)

PGI:
autogen      : (0:03:33)
configure    : (0:06:05)
make         : (0:10:01)
make install : (0:01:08)

artemry-nv commented 4 years ago

Mellanox Open MPI CI

Scope

Mellanox Open MPI CI is intended to verify Open MPI with recent Mellanox SW components (Mellanox OFED, UCX and other HPC-X components) in the Mellanox lab environment.

CI is managed by Azure Pipelines service.

Mellanox Open MPI CI includes:

Open MPI building with internal stable engineering versions of UCX and HCOLL. The building is run in Docker-based environment.
Sanity functional testing.

Related materials:

CI README: https://github.com/open-mpi/ompi/blob/master/.ci/README.md
CI config (Azure Pipeline): https://github.com/open-mpi/ompi/blob/master/.ci/mellanox/azure-pipelines.yml
CI scripts: https://github.com/mellanox-hpc/jenkins_scripts/tree/master/jenkins/ompi

Platform

CI is run on a virtual machine in a docker environment (everything is run within one docker container)
x86_64 (Intel Xeon E312xx (Sandy Bridge, IBRS update), 15 cores)
Mellanox Infiniband MT27800 Family [ConnectX-5]
OS: RHEL 7.6 (CI is run under docker in CentOS 7.6.1810)
Compiler: gcc 4.8.5
Using UCX, HCOLL from daily HPC-X builds

CI Scenarios

Configure options

Specific configure options (combinations may be used):

--with-platform=contrib/platform/mellanox/optimized
--with-ompi-param-check
--enable-picky
--enable-mpi-thread-multiple
--enable-opal-multi-threads

Build options

make -j$(nproc)

Build scenario:

./autogen.sh
./configure ...
make -j$(nproc) install
make -j$(nproc) check

Tests run

Sanity tests (over UCX/HCOLL):

hello_c, ring_c
hello_oshmem, oshmem_circular_shift, oshmem_shmalloc, oshmem_strided_puts, oshmem_symmetric_data
tune test (--mca mca_base_env_list, --tune, --am)

Timing

CI takes ~18-20 min. (mostly Open MPI building).

jjhursey commented 4 years ago

Thanks @artemry-mlnx for that information. Do you test with oshmem as well?

@bwbarrett or @wckzhang Can you list what AWS is testing in it's CI?
@hppritcha Can you list what the Cray builder is testing in it's CI?

Should we be running make distcheck in CI? Are there other OMPI integrity checks that we should be running on a routine basis?

I'm going to be out for a week, but don't let that stop progress.

jsquyres commented 4 years ago

Should we be running make distcheck in CI?

Yes! But do it in parallel to other CI jobs, because distcheck takes a while. Make sure to use whatever the appropriate AM flag is to pass down a good -j value for make into the process so that the multiple builds that distcheck does aren't performed serially. This can significantly speed up overall distcheck time.

Are there other OMPI integrity checks that we should be running on a routine basis?

make check?

wckzhang commented 4 years ago

Currently, our testing includes mtt with EFA and TCP. This tests v2.x, v3.0.x, v3.1.x, v4.0.x, and master. These are the configure options:

--oversubscribe --enable-picky --enable-debug --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio --enable-mca-no-build=io-ompio,common-ompio,sharedfp,fbtl,fcoll CC=xlc_r CXX=xlC_r FC=xlf_r --with-ofi=/opt/amazon/efa/

CFLAGS=-pipe --enable-picky --enable-debug --with-ofi=/opt/amazon/efa/

--enable-static --disable-shared

In our nightly, canary, and CI tests for libfabric, we always only use Open MPI 4.0.2 (Soon to be switched to 4.0.3). We use the release versions rather than pulling from the GitHub branch directly. These tests mainly run on our network optimized instance types, such as the c5n model types - https://aws.amazon.com/ec2/instance-types/

open-mpi / ompi-ci-tests

Level Setting: What are we running now? #1

IBM CI

Platform

Configure options

Build options

Other build types

Tests run

Timing

Mellanox Open MPI CI

Scope

Platform

CI Scenarios

Configure options

Build options

Tests run

Timing