Set up robust portable pre-push and post-push CI tools and process based on the SEMS Dev Env

trilinos / Trilinos

Primary repository for the Trilinos Project

https://trilinos.org/

Other

1.19k stars 565 forks source link

Set up robust portable pre-push and post-push CI tools and process based on the SEMS Dev Env #482

Closed bartlettroscoe closed 6 years ago

bartlettroscoe commented 8 years ago

Next Action Status:

New CI build is pushed to 'develop', new post-push CI server is running, and new checkin-test-sem.sh script ready for more testing and review ... Note going to pursue other extensions (e.g. mac OSX, tcsh, etc.). See https://github.com/trilinos/Trilinos/issues/482#issuecomment-266124179. Next: Leave in review til 1/1/2017 then close.

Blocked By: #158, #410, #362

Blocking: #380

Related To: #370, #475, #476

CC: @trilinos/framework

Description:

Trilinos has not had an effective pre-push CI development process for many years. When the checkin-test.py script was first created (back in 2008 or so), the primary stack of packages was based on Epetra and the main external dependencies were C/C++/Fortran compilers and BLAS and LAPACK. Those dependencies and the major Trilinos customers at the time were used to select the initial set of Primary Tested (initially called Primary Stable) packages that is being used to this day. However, since that time, many new Trilinos packages have been added and important Trilinos customers are relying on many of these newer packages (e.g. SEACAS, STK, Tpetra, Phalanx, Panzer, etc.). In addition, these new Trilinos packages require more dependencies than just BLAS and LAPACK and now TPLs like Boost, HDF5, NetCDF, ParMETIS, SuperLU and others used by Trilinos are also very important to many Trilinos customers.

Another problem with the current pre-push CI testing processes with Trilinos is that Trilinos developers have a variety of different types of machines, OSs, versions of compilers, TPL implementations, etc. that they use to develop on and push changes for Trilinos. This has resulted in people who tried to use the checkin-test.py script to suffer failed pushes due to failing tests on their machine not triggered by their changes. In contract, projects that have a uniform pre-push CI testing env don't experience these types of problems. One example of such a project is CASL VERA that uses TriBITS and the checkin-test.py script and has a set of uniform development machines where developers almost never see tests that fail in their build of the code that passed on another developer's build. Therefore, the only failed builds and tests are due to their own local changes. In that project, there is no trepidation to running the checkin-test.py script and everyone uses it uniformly for nearly every push.

Another problem with the current CI testing process for Trilinos is that the post-push CI server that posts to CDash enables a different set of packages and TPLs from what the pre-push CI build does (and of course uses different compilers, MPI, etc.). Therefore, a CI build/test failure seen on CDash may not be seen with the checkin-test.py script locally and visa vera. This makes it difficult for developers to determine if the failures they are seeing on their own machine are due to their local changes or due differences with the env on their machine compared to the machine running the CI build posting to CDash, if it is due to a different set of enabled packages and TPL or something else.

As a result, the stability of the main Trilinos development branch (now the 'develop' branch, see #370) has degraded from what it was 5+ years ago. This is a problem because Trilinos needs to have a more stable 'develop' branch in order to more frequently update from the 'develop' branch to the 'master' branch (see #370).

This story is to address all of these shortcomings of the current Trilinos CI testing process. The new SEMS Dev Env (#158) provides an opportunity to create a fairly portable (at least for SNL staff members) uniform pre-push and post-push CI testing environment for the first time.

Here is the plan for setting up a more effective CI process based on the SEMS Dev Env, the checkin-test.py script, and CTest/CDash:

Select a standard pre-push CI build env based on the SEMS Dev Env: Currently, GCC 4.7.2 and OpenMPI 1.6.5 are being used for the post-push CI build that posts to CDash. These selections should be reexamined and potentially changed. This will be used to create a standard load_ci_sems_dev_env.sh script, which just calls the local_sems_dev_env.sh script with the selections.
Select an expanded/revised set of Primary Tested (PT) packages and TPLs: This revised set should be based on the most important packages and TPLs to current Trilinos customers. Any important TPL not already supported by the SEMS Dev Env may need to be added (i.e. to the Trilinos space under the /projects/ NFS mount). Revising the set of PT packages and TPLs is being addressed in #410.
Set up a standard checkin-test-sems.sh script that all Trilinos developers can use to push changes to the Trilinos 'develop' branch: This should automatically load the correct standard SEMS Dev Env by sourcing load_ci_sems_dev_env.sh. This should likely only run a single build of Trilinos to speed up the testing/push process. (If there is a single build is would likely include -DTPL_ENABLE_MPI=ON -DCMAKE_BULD_TYPE=RELEASE -DTriinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_FLOAT=OFF -DTrilinos_ENABLE_COMPLEX=OFF. See #362 about turning off float and complex gy default.)
Change the main post-push CI server that posts to CDash to use the exact same build as the default builds for the checkin-test-sems.sh script: This is needed to catch the violations of the additive test assumption of branches. This can also be used to alert Trilinos developers when there are failures in the standard CI build or to verify that failures they are seeing are not their doing. If other post-push CI builds are desired, like non-MPI serial and full release builds, then those can be added as extra CI builds (we just need extra machines for that).

After this Story is complete, then we can create new Stories to get Trilinos developers to use the checkin-test-sems.sh script and to commit to keeping the CI build(s) 100% all the time with "Stop the Line" urgency to fix.

Definition of Done:

An initial implementation for load_ci_sems_dev_env.sh and checkin-test-sems.sh that provides a viable CI build based on the SEMS Dev Env.
Documentation for load_ci_sems_dev_env.sh and checkin-test-sems.sh has been written and has been reviewed by a few Trilinos developers.
The post-push CI build of Trilinos uses the same load_ci_sems_dev_env.sh env and the same default build(s) as defined in the checkin-test-sems.sh script.
Review the setup and documentation for the checkin-test.py script itself to determine what improvements that might help with usability and adoption.

Decisions that need to be made:

What default timeout should be selected for pre-push tests (e.g. 3 minutes)?
Should there just be an MPI_RELEASE_DEBUG_SHARED default build or also some serial build listed in --default-builds?
What version of GCC and OpenMPI should be used?
What other set of TPLs really should be added beyond what is provided in the SEMS Dev Env (e.g. a 64-bit build of Scotch without pthreads enabled, see this comment in #476).
???

Tasks:

Create drafts for load_ci_sems_dev_env.sh and checkin-test-sems.sh [Done]
Discuss this Story at a Trilinos Leaders Meeting Done]
Work #410 to select the updated set of PT packages and TPLs [Done]
Work #362 to disable float and complex by default [Done]
Select the new set (or just one) --default-builds for the checkin-test.py and therefore the checkin-test-sems.sh script" [Done]
- Make updates to Trilinos and checkin-test.py script on branch better-ci-build-482 ... IN PROGRESS ...
- Get proposed changes reviewed (quickly) [Done]
- Create wiki documentation for usage checkin-test-sems.sh [Done]
- Commit changes to 'develop' branch [Done]
Create a new post-push CI build on crf450 that uses the identical CI build as checkint-test-sems.py --local-do-all [Done]
- Set up cron job or Jenkins job to run the build [Done]
- Run the CI build for several days and have people review it [Done]
Have updated CI process and documentation reviewed ... In Progress ...
Update the existing Jenkins CI build to use the new CI build and then remove the CI build on crf450 ...

bartlettroscoe commented 8 years ago

This is already in progress. I have already created drafts for the load_ci_sems_dev_env.sh and checkin-test-sems.sh scripts.

bartlettroscoe commented 8 years ago

The SEMS Dev Env #158 should be ready to go (after a review). Now we need to get together a proposal for a new set of PT vs. ST packages. This is being tracked in the Issue #410.

bartlettroscoe commented 8 years ago

Now that I have been added to the necessary metagroups, I can see the setup for the Jenkins build configurations. Looking at the Trilinos CI Jenkins build at:

https://jenkins-srn.sandia.gov/view/Trilinos/job/Trilinos_continuous/configure

I can see that is basically just starts a build at 1 am MDT and then polls the main Trilinos GitHub 'develp' branch every 10 minutes. The problem with that is that it will not pick up changes in the extra repos.

But I can see the exact script that runs the build:

#!/bin/bash -ex

module load cmake/2.8.11
module load gcc/4.7.2/base
module load gcc/4.7.2/openmpi/1.6.5
module load boost/1.55.0/gcc/4.7.2/base
module load superlu/4.3/gcc/4.7.2/base
module load netcdf/4.3.2/gcc/4.7.2/openmpi/1.6.5
module load hdf5/1.8.12/gcc/4.7.2/openmpi/1.6.5
module list
env

ctest -j10 -S $WORKSPACE/Trilinos/cmake/ctest/drivers/sadl30906/ctest_linux_continuous_mpi_opt_shared_sadl30906_jenkins.cmake

All of this should be put into a version-controlled script and just that script should be run from Jenkins. Also, that script should source load_ci_sems_dev_env.sh to exactly match what the checkin-test.-sems.sh script does. This will also make it easy to test that build experimentally locally on any machine that has the SEMS Dev Env. What is interesting is that Jenkins itself recommends this in the little "help" icon text:

"As a best practice, try not to put a long shell script in here. Instead, consider adding the shell script in SCM and simply call that shell script from Jenkins (via bash -ex myscript.sh or something like that), so that you can track changes in your shell script. "

This version-controlled driver script can also be modified to perform a check if everything passed, and if it does, then update the 'master' branch from the 'develop' branch automatically. This might require some changes to the TribitsCTestDriverCore.cmake script to clearly print out (or write a results file) "ALL PASSED". This would allow us to automate the update of the 'master' branch.

bartlettroscoe commented 8 years ago

I just occurred to me that we could set up the checkin-test-sems.sh script to automatically query CDash for the latest CI build and then automatically disable any failing tests for that CI build. An updated version of CDash will allow that to occur. You would then extract that list of tests from a that Python script and then you would create the MPI_RELEASE_DEBUG_SHARED.config file to contain:

-D<failed_test_0>_DISABLE=TRUE
-D<failed_test_1>_DISABLE=TRUE
...

That would avoid developers from having to always check CDash to see if there are existing failing CI builds. However, if Trilinos developers adopt this the usage of the checkin-test-sems.sh script, then that should make it very rare that a Trilinos developer should ever run into a failing test that their local changes have not triggered in some way.

jwillenbring commented 8 years ago

@bartlettroscoe suggested I make this comment in this ticket, instead of in #158

I am comfortable with having a prepush environment that uses GCC 4.8.4, but for the short time that we have just one build protecting the promotion from develop to master, I really think we need to use GCC 4.7.2 for that promotional testing. That version has to work now, and if it doesn't, it will complicate the integration process for @bmpersc and customers.

bartlettroscoe commented 8 years ago

CC: @maherou, @bmpersc

There was an example push yesterday shown here:

http://testing.sandia.gov/cdash/index.php?date=2016-09-08&project=Trilinos&parentid=2556724

that demonstrates why we urgently need to get this story completed.

I will work this next week to get the new checkin-test.py script in place including selecting the new set of PT packages and TPLs based on the SEMS env. We need to discuss this at the next Trilinos Framework meeting.

bartlettroscoe commented 7 years ago

One issue that came up in my conversation with Alejandro Mota today about the difficulty of safely pushing changes to Trilinos was that the SEMS Dev Env is not really available for SNL/CA staff members. That is because the offical COE in SNL/CA is RHEL 5! Therefore, most SNL/CA staff members just build their own Linux machines (and they don't use the SNL/NM RHEL 6 COE). Therefore, even if they have access to the SRN and the machine where the SEMS Dev Env NFS mount directory is located, it does them no good since they are not running the SNL/NM RHEL 6 COE OS.

SEMS really needs to provide build-from-source scripts to install the SEMS Dev Env on a given Linux machine. That, or they need to build a Docker Container for RHEL 6 that has the SEMS Dev Env created on. Otherwise, Trilinos needs to provide accounts on push servers at SNL/NM that have the SEMS Dev Env available.

nmhamster commented 7 years ago

Why can't SNL/CA upgrade? Having multiple OS installs, especially ones that are so old is really difficult to support at this level.

bartlettroscoe commented 7 years ago

Why can't SNL/CA upgrade? Having multiple OS installs, especially ones that are so old is really difficult to support at this level.

I don't know the answer to that. We would have to ask them. I heard this from @amota about his situation at SNL/CA when I suggested that they use the SEMS Dev Env to provide a safe way to push to Trilinos.

nmhamster commented 7 years ago

I think we should explore this question some more. RHEL5 was getting old during my PhD!

ibaned commented 7 years ago

SEMS really needs to provide build-from-source scripts to install the SEMS Dev Env on a given Linux machine. That, or they need to build a Docker Container for RHEL 6 that has the SEMS Dev Env created on.

Either of these would be helpful for developers outside SNL/CA with their own machines.

nmhamster commented 7 years ago

Have you seen announcement that RHEL5 is to be retired? I think 6 and 7 are only suppprted now.

bartlettroscoe commented 7 years ago

Have you seen announcement that RHEL5 is to be retired? I think 6 and 7 are only suppprted now.

But that is not the immediate issue. The immediate issue is that because RHEL 5 was the offical COE at SNL/CA people went off and built their own linux workstations not using the offical RHEL 6 and 7 COEs. That is the problem.

bartlettroscoe commented 7 years ago

I suspect that commit f1225dc606 is something that would have been caught by this updated pre-push CI testing process using checkin-test-sems.sh

bartlettroscoe commented 7 years ago

Being able to do robust git bisection is one of the motivations for using the checkin-test.py script to push all commits. Below is a concrete example. Stefan reports that half of the commits he is trying to bisect on are not even passing configure.

From: Bartlett, Roscoe A Sent: Saturday, October 22, 2016 11:06 AM To: Domino, Stefan Paul; Trilinos Developers List Subject: RE: [Trilinos-developers] [EXTERNAL] New Nalu diffs...

Stefan,

You can first bisect on commits that are marked as good by the checkin-test.py script. These have the string “Build/Test Cases Summary” in the git commit log. In the first round of git bisect, you then skip all commits that don’t have that string and you can bound the true bad commit without hitting false failures due to bad non-complete commits (like you describe below). Details are described here:

https://tribits.org/doc/TribitsDevelopersGuide.html#using-git-bisect-with-checkin-test-py-workflows

Unfortunately, very few Trilinos developers are currently using the checkin-test.py script to push so you will not be able to do very fine-grained bisection with recent commits using that approach (but it will bound the bad commit and then you can do manual bisection from there).

We are trying to get things set up so that everyone can use the checkin-test.py script when pushing to Trilinos:

https://github.com/trilinos/Trilinos/issues/482

I was hoping to have that done before the TUG.

Once everyone is using the checkin-test.py script to test and push, you will be able to do pretty fine-trained bisection safely.

Cheers,

-Ross

From: Trilinos-developers [mailto:trilinos-developers-bounces@trilinos.org] On Behalf Of Domino, Stefan Paul Sent: Friday, October 21, 2016 9:09 PM To: Trilinos Developers List Subject: Re: [Trilinos-developers] [EXTERNAL] New Nalu diffs...

One final data point for the weekend. Half of the SHA1 configure steps are failing during my bisect of Trilinos/master.

Moreover, even some that are configuring fail in various Trilinos build errors.

Any suggestions on how to bisect such a situation would be most appreciated.

Stefan

From: Trilinos-developers trilinos-developers-bounces@trilinos.org on behalf of "Domino, Stefan Paul" spdomin@sandia.gov Date: Friday, October 21, 2016 at 5:19 PM To: Trilinos Developers List trilinos-developers@trilinos.org Subject: [EXTERNAL] [Trilinos-developers] New Nalu diffs...

Greetings,

Any other apps seeing new diffs as of:

commit 5320963fbe2ec9aeba1dc62871e4be4f44961970

I have started a GH ticket under NaluCFD to start tracking how many times I need to bisect Trilinos per week:

https://github.com/NaluCFD/Nalu/issues/14

Best,

Stefan

tjfulle commented 7 years ago

Perhaps this has already been discussed, but instead of using different shell scripts to call checkin-test.py that setup the user environment (and must, therefore be sourced), why not add the functionality directly to checkin-test.py? A --use-sems-env (or similar) flag can be added to the options accepted by checkin-test.py that, if set, tells checkin-test.py to "load" the SEMS modules. Since checkin-test.py is not itself sourced, it would call the modulecmd executable directly (bypassing the module shell function), parse modulecmd's stdout, and put it in a extraEnv dictionary that would then be passed along to tribits.python_utils.runSysCmndInterface. Perhaps better still, tribits.python_utils.runSysCmndInterface could be modified to take the fullEnv dictionary as an argument (that defaults to os.environ) and checkin-test.py could then set the full environment. This would guarantee a consistent test environment that is not "corrupted" by any user environment settings.

The following code could be added to checkin-test.py and would create the fullEnv dictionary (that would then be passed to tribits.python_utils.runSysCmndInterface):

fullEnv = None
if use_sems_env:
    required_sems_modules = ['cmake/2.8.11', 
                             'gcc/4.7.2/base', 
                             'gcc/4.7.2/openmpi/1.6.5',
                             'boost/1.55.0/gcc/4.7.2/base', 
                             'superlu/4.3/gcc/4.7.2/base',
                             'netcdf/4.3.2/gcc/4.7.2/openmpi/1.6.5', 
                             'hdf5/1.8.12/gcc/4.7.2/openmpi/1.6.5']
    # And whatever else is needed in default environment
    fullEnv = {'PATH': os.environ['PATH'],
               'HOME': os.environ['HOME'],
               'MODULEPATH': os.environ['MODULEPATH']}
    for sems_module in required_sems_modules:
        fullEnv.update(load_sems_module(sems_module, fullEnv))

import os
import re
import subprocess
def load_sems_module(module, env):
    # adjust module command to correct path
    modulecmd = '/usr/local/Cellar/modules/3.2.10/Modules/bin/modulecmd'
    command = '{0} csh load {1}'.format(modulecmd, module)
    proc = subprocess.Popen(command.split(),
                            stdout=subprocess.PIPE,
                            stderr=subprocess.STDOUT,
                            env=env)
    proc.wait()
    out, err = proc.communicate()
    if re.search(r'.*\([0-9]+\):ERROR:', out):
        raise Exception(out)
    for line in out.split(';'):
        line = line.split()
        if not line or line[0] != 'setenv':
            continue
        env[line[1]] = ' '.join(line[2:])
    return env

Since the SEMS modules also set SEMS_${TPLNAME}_LIBRARY_PATH (and analogous INCLUDE_PATH), variables, the COMMON.config file could by modified easily to explicitly set -D${TPLNAME}_LIBRARY_DIRS:FILEPATH=... from the SEMS_${TPLNAME}_LIBRARY_PATH (and the same with the INCLUDE_DIRS).

Just some thoughts...

mhoemmen commented 7 years ago

Just wanted to put in my 2 cents -- @tjfulle very generously spent time to contribute to the check-in test script. It would be awesome if we could let him help out there. He was about to push changes, but I wasn't sure if it properly belonged to Trilinos or to TriBITS.

bartlettroscoe commented 7 years ago

Just wanted to put in my 2 cents -- @tjfulle very generously spent time to contribute to the check-in test script. It would be awesome if we could let him help out there. He was about to push changes, but I wasn't sure if it properly belonged to Trilinos or to TriBITS.

Contributions to TriBITS are most welcome. However, any non-trivial change needs to be pushed to the TriBITS GitHub repo.

@tjfuller, please follow the process outlined at:

https://github.com/TriBITSPub/TriBITS/wiki/Contributing-to-TriBITS#process_outline

and then it will get snapshotted to Trilinos as described here:

http://trac.trilinos.org/wiki/TriBITSTrilinosDev

I will respond to the above comment in detail in a bit.

But note that the checkin-test.py script has to be more general than Trilinos and Sandia (it is being used heavily at ORNL for CASL). What we need is a more general solution for associating particular build envs as mentioned here.

And welcome to SNL and Trilinos!

-Ross

mhoemmen commented 7 years ago

Thanks @bartlettroscoe !

bartlettroscoe commented 7 years ago

We have a problem with this strategy. The only GCC compiler that is available on OSX for in the SEMS env is GCC 5.3.0 and the Boost 1.55.0 is not present as shown by:

$ module avail

...

--------- /projects/sems/modulefiles/Darwin10.11-x86_64/sems/compiler ----------
sems-gcc/5.3.0      sems-openmpi/1.8.7  sems-python/3.5.2
sems-openmpi/1.10.1 sems-python/2.7.9
sems-openmpi/1.6.5  sems-python/3.4.2

------------ /projects/sems/modulefiles/Darwin10.11-x86_64/sems/tpl ------------
sems-astroid/1.4.3/base            sems-parmetis/4.0.3/32bit_parallel
sems-beautifulsoup4/4.4.1/base     sems-parmetis/4.0.3/64bit_parallel
sems-boost/1.58.0/base             sems-parmetis/4.0.3/parallel
sems-boost/1.59.0/base             sems-pylint/1.5.4/base
sems-dateutil/2.5.3/base           sems-pyparsing/2.0.3/base
sems-gprof2dot/2015.12.1/base      sems-pytz/2014.10/base
sems-hdf5/1.8.12/base              sems-qd/2.3.15/base
sems-hdf5/1.8.12/parallel          sems-scipy/0.15.1/base
sems-logilab/1.0.2/base            sems-scons/2.3.6/base
sems-matplotlib/1.4.2/base         sems-setuptools/22.0.5/base
sems-netcdf/4.3.2/base             sems-six/1.9.0/base
sems-netcdf/4.3.2/parallel         sems-superlu/4.3/base
sems-numpy/1.9.1/base              sems-zlib/1.2.8/base

...

Therefore, in order to have one consistent CI env across all platforms is to use GCC 5.3.0 (and Boost 1.58.0 or 1.59.0 but not 1.55.0 which is the current selection). That may not be so bad. If we get all warnings out of GCC 5.3.0 they will likely be gone from GCC 4.7.2 as well. However, there is the problem of people using features of C++11 present in GCC 5.3.0 that is not present with GCC 4.7.2. Therefore, we would need to run a second post-push CI build that tests with GCC 4.7.2 to make sure things are okay and deal with them quickly.

What do people thing about this? Is GCC 5.3.0 a valid choice for the CI build for Trilinos? That is our only option currently. Otherwise, we are going to need to force OSX developers to test and push from a Linux machine that mounts the SEMS env.

bartlettroscoe commented 7 years ago

Another problem is that the SEMS env does not even provide a Scotch TPL as shown by:

$ module avail

...

------------ /projects/sems/modulefiles/Darwin10.11-x86_64/sems/tpl ------------
sems-astroid/1.4.3/base            sems-parmetis/4.0.3/32bit_parallel
sems-beautifulsoup4/4.4.1/base     sems-parmetis/4.0.3/64bit_parallel
sems-boost/1.58.0/base             sems-parmetis/4.0.3/parallel
sems-boost/1.59.0/base             sems-pylint/1.5.4/base
sems-dateutil/2.5.3/base           sems-pyparsing/2.0.3/base
sems-gprof2dot/2015.12.1/base      sems-pytz/2014.10/base
sems-hdf5/1.8.12/base              sems-qd/2.3.15/base
sems-hdf5/1.8.12/parallel          sems-scipy/0.15.1/base
sems-logilab/1.0.2/base            sems-scons/2.3.6/base
sems-matplotlib/1.4.2/base         sems-setuptools/22.0.5/base
sems-netcdf/4.3.2/base             sems-six/1.9.0/base
sems-netcdf/4.3.2/parallel         sems-superlu/4.3/base
sems-numpy/1.9.1/base              sems-zlib/1.2.8/base

So Scotch had to be removed from the default set of TPLs because. However, we could not use the Scotch provided because it is 32 bit which is not compatible with the 64 bit Parmetis that is installed (see commit b339bc6).

I pushed this updated SEMS env that also works for OSX to the branch:

https://github.com/bartlettroscoe/Trilinos/tree/sems-dev-osx-182

bartlettroscoe commented 7 years ago

I finally got the updated of candidate PT package, SE packages and TPLs in #482. That set is (packages sorted alphabetically):

Full set of packages: Amesos Amesos2 Anasazi AztecOO Belos Claps Domi Drekar DrekarMHD Epetra EpetraExt FEI Galeri GlobiPack Gtest Ifpack Ifpack2 Intrepid Intrepid2 Isorropia Kokkos Mesquite ML MOOCHO MueLu NOX OptiPack Pamgen Panzer Phalanx Pike Piro ROL RTOp Rythmos Sacado SEACAS Shards ShyLU STK Stokhos Stratimikos Teko Teuchos ThreadPool Thyra Tpetra TriKota TrilinosCouplings Triutils Xpetra Zoltan Zoltan2
Full set of SE packages: Amesos Amesos2 Anasazi AztecOO Belos Claps Domi Drekar DrekarMHD Epetra EpetraExt FEI Galeri GlobiPack Gtest Ifpack Ifpack2 Intrepid Intrepid2 IntrepidCore Isorropia Kokkos KokkosAlgorithms KokkosContainers KokkosCore KokkosTPL Mesquite ML MOOCHO MueLu NOX OptiPack Pamgen Panzer PanzerAdaptersSTK PanzerCore PanzerDiscFE PanzerDofMgr Phalanx Pike PikeBlackBox PikeImplicit Piro ROL RTOp Rythmos Sacado SEACAS SEACASAlgebra SEACASAprepro SEACASAprepro_lib SEACASBlot SEACASChaco SEACASConjoin SEACASEjoin SEACASEpu SEACASEx1ex2v2 SEACASEx2ex1v2 SEACASExo_format SEACASExodiff SEACASExodus SEACASExodus_for SEACASExoIIv2for32 SEACASExomatlab SEACASExotxt SEACASFastq SEACASGen3D SEACASGenshell SEACASGjoin SEACASGrepos SEACASGrope SEACASIoss SEACASMapvar SEACASMapvar-kd SEACASMapvarlib SEACASNemesis SEACASNemslice SEACASNemspread SEACASNumbers SEACASPLT SEACASSupes SEACASSuplib SEACASSuplibC SEACASSuplibCpp SEACASSVDI SEACASTxtexo Shards ShyLU ShyLUCore STK STKClassic STKDoc_tests STKExprEval STKIO STKMesh STKSearch STKSearchUtil STKTopology STKTransfer STKUnit_test_utils STKUnit_tests STKUtil Stokhos Stratimikos Teko Teuchos TeuchosComm TeuchosCore TeuchosKokkosComm TeuchosKokkosCompat TeuchosNumerics TeuchosParameterList TeuchosRemainder ThreadPool Thyra ThyraCore ThyraEpetraAdapters ThyraEpetraExtAdapters ThyraTpetraAdapters Tpetra TpetraClassic TpetraCore TpetraKernels TpetraTSQR TriKota TrilinosCouplings Triutils Xpetra Zoltan Zoltan2
Subset of TPLs provided by SEMS: Pthread MPI BLAS LAPACK Boost ParMETIS Zlib HDF5 Netcdf SuperLU BoostLib DLlib

As stated above, this is just a subset of the TPLs being used by important Trilinos customers. This subset is limited by what SEMS provides.

I want to hammer through this quickly now and get this story done. Therefore, I will do the following, all on one topic branch:

In the file Trilinos/cmake/RepositoryDependenciesSetup.cmake, add a cmake option Trilinos_ENABLE_CI_TEST_CONFIG that will be OFF by default but when set to ON it will:
- Set the default TPL enables for the SEMS-provided TPLs
- Disable a bunch of scalar and ordinal types to leave just the minimal scalar type 'double' and ordinal type 'long long int'
Remove everything from SEMSDenvEnv.cmake that does not just set the compilers and the TPL locations. The rest of the stuff will be set by -DTrilinos_ENABLE_CI_TEST_CONFIG=ON.
Update the list of PT packages in the file Trilinos/PackageList.cmake and PT TPLs in the file Trilinos/TPLsList.cmake as specified above.
Update the set of PT subpackages in the PT parent packages in order to produce the exact same list of enable SE packages show above. This can be done by just running configure and seeing what SE packages get enabled and go from there.
Update the file Trilinos/project-checkin-test-config.py to specify just one build MPI_DEBUG_RELEASE_SHARED that sets Trilinos_ENABLE_CI_TEST_CONFIG=ON, turns on ETI and other reasonable values for a single effective CI build for Trilinos.

The above approach decouples the selection of the SEMS env from the CI testing configuration. Setting up the variable Trilinos_ENABLE_CI_TEST_CONFIG that disables a bunch of scale and ordinal types allows me to avoid doing #362 for now. That would be good but that can be done later.

This strategy makes it easy to set the exact CI configuration for Trilinos, no matter what env is used. That also makes it easier for the checkin-test.py script and the post-push CI server to have exactly the same builds.

I will update the list of tasks in the above description field and get to work on this. It should not take long to get the above things done on a branch.

bartlettroscoe commented 7 years ago

While updating the Trilinos/PackagesList.cmake file, I realized that some of the new set of PT packages needed to be trimmed some. See details in this comment in #410. The new set of PT packages and TPLs is then:

New set of PT packages: Amesos Amesos2 Anasazi AztecOO Belos Domi Epetra EpetraExt FEI Galeri GlobiPack Gtest Ifpack Ifpack2 Intrepid Intrepid2 Isorropia Kokkos ML MueLu NOX OptiPack Pamgen Panzer Phalanx Pike Piro ROL RTOp Rythmos Sacado SEACAS Shards ShyLU STK Stokhos Stratimikos Teko Teuchos ThreadPool Thyra Tpetra TrilinosCouplings Triutils Xpetra Zoltan Zoltan2
New set of PT SE packages: Amesos Amesos2 Anasazi AztecOO Belos Domi Epetra EpetraExt FEI Galeri GlobiPack Gtest Ifpack Ifpack2 Intrepid Intrepid2 IntrepidCore Isorropia Kokkos KokkosAlgorithms KokkosContainers KokkosCore KokkosTPL ML MueLu NOX OptiPack Pamgen Panzer PanzerAdaptersSTK PanzerCore PanzerDiscFE PanzerDofMgr Phalanx Pike PikeBlackBox PikeImplicit Piro ROL RTOp Rythmos Sacado SEACAS SEACASAlgebra SEACASAprepro SEACASAprepro_lib SEACASBlot SEACASChaco SEACASConjoin SEACASEjoin SEACASEpu SEACASEx1ex2v2 SEACASEx2ex1v2 SEACASExo_format SEACASExodiff SEACASExodus SEACASExodus_for SEACASExoIIv2for32 SEACASExomatlab SEACASExotxt SEACASFastq SEACASGen3D SEACASGenshell SEACASGjoin SEACASGrepos SEACASGrope SEACASIoss SEACASMapvar SEACASMapvar-kd SEACASMapvarlib SEACASNemesis SEACASNemslice SEACASNemspread SEACASNumbers SEACASPLT SEACASSupes SEACASSuplib SEACASSuplibC SEACASSuplibCpp SEACASSVDI SEACASTxtexo Shards ShyLU ShyLUCore STK STKClassic STKDoc_tests STKExprEval STKIO STKMesh STKSearch STKSearchUtil STKTopology STKTransfer STKUnit_test_utils STKUnit_tests STKUtil Stokhos Stratimikos Teko Teuchos TeuchosComm TeuchosCore TeuchosKokkosComm TeuchosKokkosCompat TeuchosNumerics TeuchosParameterList TeuchosRemainder ThreadPool Thyra ThyraCore ThyraEpetraAdapters ThyraEpetraExtAdapters ThyraTpetraAdapters Tpetra TpetraClassic TpetraCore TpetraKernels TpetraTSQR TriKota TrilinosCouplings Triutils Xpetra Zoltan Zoltan2
New set of PT TPLs: Pthread MPI BLAS LAPACK Boost ParMETIS Zlib HDF5 Netcdf SuperLU BoostLib DLlib

bartlettroscoe commented 7 years ago

I completed the changes for the new PT CI build and got all passing tests (after disabling the existing failing tests in #826 and #828) for GCC 4.7.2 with the SEMS TPLs. I pushed to the branch:

https://github.com/bartlettroscoe/Trilinos/tree/better-ci-build-482

I ran the updated checkin-test-sems.sh script and it produced the result show below.

From: Roscoe A Bartlett [mailto:rabartl@crf450.srn.sandia.gov]
Sent: Friday, November 11, 2016 5:52 PM
To: Bartlett, Roscoe A
Subject: passed: Trilinos/MPI_RELEASE_DEBUG_SHARED:
passed=2263,notpassed=0

passed: Trilinos/MPI_RELEASE_DEBUG_SHARED: passed=2263,notpassed=0

Fri Nov 11 15:51:36 MST 2016

Enabled Packages:
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages
Hostname: crf450.srn.sandia.gov
Source Dir: /home/rabartl/Trilinos.base/Trilinos
Build Dir:
/home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED

CMake Cache Varibles: -
DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits
-DTrilinos_ENABLE_TESTS:BOOL=ON -
DTrilinos_TEST_CATEGORIES:STRING=BASIC -
DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -
DDART_TESTING_TIMEOUT:STRING=180.0 -DBUILD_SHARED_LIBS=ON -
DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -
DTrilinos_TRACE_ADD_TEST=ON -
DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -
DTPL_ENABLE_MPI=ON -DCMAKE_BUILD_TYPE=RELEASE -
DTrilinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -
DTrilinos_ENABLE_DEBUG_SYMBOLS=ON -
DTrilinos_ENABLE_CI_TEST_MODE=ON -
DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -
DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF -
DTrilinos_ENABLE_TESTS=ON -DTeuchos_ENABLE_DEFAULT_STACKTRACE=OFF
-DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -
DTrilinos_TRACE_ADD_TEST=ON -
DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -
DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -
DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -
DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -
DTrilinos_ENABLE_TriKota:BOOL=OFF
Make Options: -j16
CTest Options: -j16

Pull: Not Performed
Configure: Passed (2.47 min)
Build: Passed (67.60 min)
Test: Passed (11.53 min)

100% tests passed, 0 tests failed out of 2263

Label Time Summary:
Amesos               =  20.20 sec (13 tests)
Amesos2              =   9.52 sec (7 tests)
Anasazi              = 102.90 sec (71 tests)
AztecOO              =  17.14 sec (17 tests)
Belos                =  92.93 sec (61 tests)
Domi                 = 135.41 sec (106 tests)
Epetra               =  50.61 sec (61 tests)
EpetraExt            =  13.16 sec (10 tests)
FEI                  =  41.14 sec (43 tests)
Galeri               =   4.69 sec (9 tests)
GlobiPack            =   2.61 sec (6 tests)
Ifpack               =  59.36 sec (53 tests)
Ifpack2              =  47.30 sec (32 tests)
Intrepid             = 202.18 sec (152 tests)
Intrepid2            = 103.76 sec (107 tests)
Isorropia            =   8.27 sec (6 tests)
Kokkos               = 255.68 sec (21 tests)
ML                   =  48.44 sec (34 tests)
MueLu                = 264.37 sec (54 tests)
NOX                  = 137.57 sec (100 tests)
OptiPack             =   6.12 sec (5 tests)
Panzer               = 267.68 sec (125 tests)
Phalanx              =   5.84 sec (15 tests)
Pike                 =   4.93 sec (7 tests)
Piro                 =  25.19 sec (11 tests)
ROL                  = 668.21 sec (112 tests)
RTOp                 =  15.49 sec (24 tests)
Rythmos              = 163.94 sec (83 tests)
SEACAS               =   7.84 sec (8 tests)
STK                  =  14.00 sec (12 tests)
Sacado               = 101.74 sec (290 tests)
Shards               =   1.81 sec (4 tests)
ShyLU                =   8.22 sec (5 tests)
Stokhos              = 109.20 sec (74 tests)
Stratimikos          =  28.70 sec (39 tests)
Teko                 = 211.18 sec (19 tests)
Teuchos              =  55.44 sec (122 tests)
ThreadPool           =  10.25 sec (10 tests)
Thyra                =  67.40 sec (80 tests)
Tpetra               = 123.68 sec (119 tests)
TrilinosCouplings    =  57.85 sec (19 tests)
Triutils             =   2.62 sec (2 tests)
Xpetra               =  38.59 sec (16 tests)
Zoltan               = 196.74 sec (16 tests)
Zoltan2              = 132.35 sec (91 tests)

Total Test time (real) = 691.87 sec

Total time for MPI_RELEASE_DEBUG_SHARED = 81.61 min

See the attached files:

Note that this is with SEMS env:

Loading Trilinos SEMS Dev Env = 'sems-gcc/4.7.2 sems-openmpi/1.6.5 sems-cmake/3.5.2'!
Currently Loaded Modulefiles:
  1) sems-env                       4) sems-gcc/4.7.2                 7) sems-zlib/1.2.8/base          10) sems-parmetis/4.0.3/parallel
  2) sems-python/2.7.9              5) sems-openmpi/1.6.5             8) sems-hdf5/1.8.12/parallel     11) sems-scotch/6.0.3/parallel
  3) sems-cmake/3.5.2               6) sems-boost/1.55.0/base         9) sems-netcdf/4.3.2/parallel    12) sems-superlu/4.3/base
File local-checkin-test-defaults.py already exists, leaving it!

This env is not available on OSX. Therefore, I will try the env with GCC 5.3.0 and Boost 1.58.0 and see how that does on OSX (gaia).

bartlettroscoe commented 7 years ago

(2016/11/12)

I updated the env to GCC 5.3.0 so that it should also work on OSX. When I ran this on my Linux machine, the test TeuchosParameterList_ObjectBuilder_UnitTests failed. I created #831 for that and I ifdefed out the failing unit tests.

The version of Trilinos on the topic branch [better-ci-biuld-482]() was:

3cf9171 "Merge branch 'better-ci-build-482' of github.com:bartlettroscoe/Trilinos into better-ci-build-482"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Sat Nov 12 12:53:29 2016 -0700 (2 days ago)

On my Linux machine crf450, the full PT CI build now passes as shown by:

From: Roscoe A Bartlett [mailto:rabartl@crf450.srn.sandia.gov]
Sent: Saturday, November 12, 2016 3:11 PM
To: Bartlett, Roscoe A
Subject: passed: Trilinos/MPI_RELEASE_DEBUG_SHARED:
passed=2286,notpassed=0

passed: Trilinos/MPI_RELEASE_DEBUG_SHARED: passed=2286,notpassed=0

Sat Nov 12 13:11:04 MST 2016

Enabled Packages:
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages
Hostname: crf450.srn.sandia.gov
Source Dir: /home/rabartl/Trilinos.base/Trilinos
Build Dir:
/home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED

CMake Cache Varibles: -
DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits
-DTrilinos_ENABLE_TESTS:BOOL=ON -
DTrilinos_TEST_CATEGORIES:STRING=BASIC -
DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -
DDART_TESTING_TIMEOUT:STRING=180.0 -DBUILD_SHARED_LIBS=ON -
DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -
DTrilinos_TRACE_ADD_TEST=ON -
DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -
DTPL_ENABLE_MPI=ON -DCMAKE_BUILD_TYPE=RELEASE -
DTrilinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -
DTrilinos_ENABLE_DEBUG_SYMBOLS=ON -
DTrilinos_ENABLE_CI_TEST_MODE=ON -
DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -
DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF -
DTrilinos_ENABLE_TESTS=ON -DTeuchos_ENABLE_DEFAULT_STACKTRACE=OFF
-DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -
DTrilinos_TRACE_ADD_TEST=ON -
DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -
DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -
DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -
DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -
DTrilinos_ENABLE_TriKota:BOOL=OFF
Make Options: -j16
CTest Options: -j16

Pull: Not Performed
Configure: Passed (2.53 min)
Build: Passed (3.43 min)
Test: Passed (10.25 min)

100% tests passed, 0 tests failed out of 2286

Label Time Summary:
Amesos               =  17.77 sec (13 tests)
Amesos2              =   9.08 sec (7 tests)
Anasazi              = 102.96 sec (71 tests)
AztecOO              =  15.40 sec (17 tests)
Belos                =  86.77 sec (61 tests)
Domi                 = 150.84 sec (125 tests)
Epetra               =  41.70 sec (61 tests)
EpetraExt            =  13.90 sec (10 tests)
FEI                  =  38.16 sec (43 tests)
Galeri               =   3.56 sec (9 tests)
GlobiPack            =   1.15 sec (6 tests)
Ifpack               =  56.94 sec (53 tests)
Ifpack2              =  38.58 sec (32 tests)
Intrepid             = 181.38 sec (152 tests)
Intrepid2            = 104.94 sec (107 tests)
Isorropia            =   7.97 sec (6 tests)
Kokkos               = 176.33 sec (21 tests)
ML                   =  44.66 sec (34 tests)
MueLu                = 227.18 sec (54 tests)
NOX                  = 127.42 sec (100 tests)
OptiPack             =   6.08 sec (5 tests)
Panzer               = 232.67 sec (125 tests)
Phalanx              =   3.64 sec (15 tests)
Pike                 =   2.11 sec (7 tests)
Piro                 =  23.54 sec (11 tests)
ROL                  = 533.14 sec (112 tests)
RTOp                 =  10.20 sec (24 tests)
Rythmos              = 144.93 sec (83 tests)
SEACAS               =   6.72 sec (8 tests)
STK                  =  22.62 sec (12 tests)
Sacado               =  36.70 sec (290 tests)
Shards               =   0.41 sec (4 tests)
ShyLU                =   7.76 sec (5 tests)
Stokhos              =  87.07 sec (74 tests)
Stratimikos          =  25.60 sec (39 tests)
Teko                 = 130.83 sec (19 tests)
Teuchos              =  38.18 sec (123 tests)
ThreadPool           =   7.53 sec (10 tests)
Thyra                =  58.97 sec (80 tests)
Tpetra               = 116.77 sec (122 tests)
TrilinosCouplings    =  49.01 sec (19 tests)
Triutils             =   2.27 sec (2 tests)
Xpetra               =  41.65 sec (16 tests)
Zoltan               = 215.67 sec (16 tests)
Zoltan2              = 120.16 sec (91 tests)

Total Test time (real) = 615.03 sec

Total time for MPI_RELEASE_DEBUG_SHARED = 16.21 min

Here are the detailed output files:

Note that the Domi tests were fixed and are now shown a passing (see #828).

So we are now good to go for Linux. I am testing on OSX right now. It looks like there are failures there which I will need to look into.

bartlettroscoe commented 7 years ago

(2016/11/12)

On OSX gaia, I had to make some fixes to the checkin-test-sems.sh script to even git to run there (you can't follow a symlink in bash).

The version of Trilinos on the topic branch [better-ci-biuld-482]() was:

3cf9171 "Merge branch 'better-ci-build-482' of github.com:bartlettroscoe/Trilinos into better-ci-build-482"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Sat Nov 12 12:53:29 2016 -0700 (2 days ago)

Testing on OSX gaia it encountered build failures. To see the full extent of the build failures I built ignoring errors so it would complete as much as it could:

$ cd MPI_RELEASE_DEBUG_SHARED
$ make -j12 -k &> make.out

That showed the failed targets:

$ grep "Error " make.out 
make[2]: *** [packages/stk/stk_util/stk_util/diag/libstk_util_diag.12.9.dylib] Error 1
make[1]: *** [packages/stk/stk_util/stk_util/diag/CMakeFiles/stk_util_diag.dir/all] Error 2
make[2]: *** [packages/stk/stk_util/stk_util/registry/libstk_util_registry.12.9.dylib] Error 1
make[1]: *** [packages/stk/stk_util/stk_util/registry/CMakeFiles/stk_util_registry.dir/all] Error 2
    *** [packages/stk/stk_util/stk_util/use_cases/libstk_util_use_cases.12.9.dylib] Error 1  
      make[1]: stk::get_memory_high_water_mark_across_processors(ompi_communicator_t*, unsigned long&, unsigned long&, unsigned long&)*** [packages/stk/stk_util/stk_util/use_cases/CMakeFiles/stk_util_use_cases.dir/all] Error 2
make[2]: *** [packages/stk/stk_util/stk_util/environment/libstk_util_env.12.9.dylib] Error 1
make[1]: *** [packages/stk/stk_util/stk_util/environment/CMakeFiles/stk_util_env.dir/all] Error 2
make[2]: *** [packages/rol/example/burgers-control/CMakeFiles/ROL_example_burgers-control_example_07.dir/example_07.cpp.o] Error 1
make[1]: *** [packages/rol/example/burgers-control/CMakeFiles/ROL_example_burgers-control_example_07.dir/all] Error 2

Examining the first build failure:

$ cd packages/stk/stk_util/stk_util/diag/

$ make VERBOSE=1
[...]
/projects/sems/install/Darwin10.11-x86_64/sems/compiler/gcc/5.3.0/openmpi/1.6.5/bin/mpicxx   -pedantic -Wall -Wno-long-long -Wwrite-strings  -g -std=c++11 -O3 -DNDEBUG -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -dynamiclib -Wl,-headerpad_max_install_names -compatibility_version 12.0.0 -current_version 12.9.0 -o libstk_util_diag.12.9.dylib -install_name @rpath/libstk_util_diag.12.dylib CMakeFiles/stk_util_diag.dir/Mapv.cpp.o CMakeFiles/stk_util_diag.dir/Option.cpp.o CMakeFiles/stk_util_diag.dir/Platform.cpp.o CMakeFiles/stk_util_diag.dir/PrintTable.cpp.o CMakeFiles/stk_util_diag.dir/PrintTimer.cpp.o CMakeFiles/stk_util_diag.dir/SlibDiagWriter.cpp.o CMakeFiles/stk_util_diag.dir/String.cpp.o CMakeFiles/stk_util_diag.dir/StringUtil.cpp.o CMakeFiles/stk_util_diag.dir/Timer.cpp.o CMakeFiles/stk_util_diag.dir/TimerMetricTraits.cpp.o CMakeFiles/stk_util_diag.dir/UserPlugin.cpp.o CMakeFiles/stk_util_diag.dir/WriterExt.cpp.o CMakeFiles/stk_util_diag.dir/WriterParser.cpp.o CMakeFiles/stk_util_diag.dir/WriterRegistry.cpp.o ../../../../seacas/libraries/aprepro_lib/libaprepro_lib.12.9.dylib /projects/sems/install/Darwin10.11-x86_64/sems/tpl/boost/1.58.0/gcc/5.3.0/base/lib/libboost_program_options.dylib /projects/sems/install/Darwin10.11-x86_64/sems/tpl/boost/1.58.0/gcc/5.3.0/base/lib/libboost_system.dylib -Wl,-rpath,/Users/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED/packages/seacas/libraries/aprepro_lib
Undefined symbols for architecture x86_64:
  "stk::formatTime[abi:cxx11](double, unsigned long)", referenced from:
      stk::diag::MetricTraits<stk::diag::CPUTime>::format[abi:cxx11](double) in TimerMetricTraits.cpp.o
      stk::diag::MetricTraits<stk::diag::WallTime>::format[abi:cxx11](double) in TimerMetricTraits.cpp.o
  "stk::formatMemorySize[abi:cxx11](double)", referenced from:
      stk::diag::MetricTraits<stk::diag::HeapAlloc>::format[abi:cxx11](double) in TimerMetricTraits.cpp.o
  "stk::parallel_machine_rank(ompi_communicator_t*)", referenced from:
      stk::diag::(anonymous namespace)::printTable(stk::PrintTable&, stk::diag::Timer&, unsigned long, unsigned long, bool, ompi_communicator_t*) [clone .constprop.211] in PrintTimer.cpp.o
      stk::diag::printTimersTable(std::basic_ostream<char, std::char_traits<char> >&, stk::diag::Timer, unsigned long, bool, ompi_communicator_t*) in PrintTimer.cpp.o
  "stk::parallel_machine_size(ompi_communicator_t*)", referenced from:
      stk::diag::(anonymous namespace)::printTable(stk::PrintTable&, stk::diag::Timer&, unsigned long, unsigned long, bool, ompi_communicator_t*) [clone .constprop.211] in PrintTimer.cpp.o
      stk::diag::printTimersTable(std::basic_ostream<char, std::char_traits<char> >&, stk::diag::Timer, unsigned long, bool, ompi_communicator_t*) in PrintTimer.cpp.o
  "stk::diag::WriterThrowSafe::WriterThrowSafe(stk::diag::Writer&)", referenced from:
      sierra::Diag::WriterThrowSafe::WriterThrowSafe() in WriterRegistry.cpp.o
  "stk::diag::WriterThrowSafe::~WriterThrowSafe()", referenced from:
      sierra::Diag::WriterThrowSafe::~WriterThrowSafe() in WriterRegistry.cpp.o
      sierra::Diag::WriterThrowSafe::~WriterThrowSafe() in WriterRegistry.cpp.o
  "stk::diag::Trace::s_traceList", referenced from:
      stk::diag::WriterParser::parseArg(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const in WriterParser.cpp.o
[...]

I did not dig into this too far but @bathmatt reported that STK did not build on OSX with shared libraries turned on (see this comment in #811).

And the ROL failures look to be unrelated to the STK build failures. For example, the build failure for the example packages/rol/example/burgers-control/example_07.cpp shows the build failure:

/Users/rabartl/Trilinos.base/Trilinos/packages/rol/example/burgers-control/example_07.hpp:1154:10: error: ‘uint’ was not declared in this scope
     for (uint i = 1; i < indices_.size(); i++) {
          ^
/Users/rabartl/Trilinos.base/Trilinos/packages/rol/example/burgers-control/example_07.hpp:1154:22: error: ‘i’ was not declared in this scope
     for (uint i = 1; i < indices_.size(); i++) {
                      ^
/Users/rabartl/Trilinos.base/Trilinos/packages/rol/example/burgers-control/example_07.hpp: In member function ‘void Objective_BurgersControl<Real>::gradient_1(ROL::Vector<Real>&, const ROL::Vector<Real>&, const\
 ROL::Vector<Real>&, Real&)’:
/Users/rabartl/Trilinos.base/Trilinos/packages/rol/example/burgers-control/example_07.hpp:1180:10: error: ‘uint’ was not declared in this scope
     for (uint i = 1; i < indices_.size(); i++) {
          ^
/Users/rabartl/Trilinos.base/Trilinos/packages/rol/example/burgers-control/example_07.hpp:1180:22: error: ‘i’ was not declared in this scope
     for (uint i = 1; i < indices_.size(); i++) {
                      ^
make[3]: *** [packages/rol/example/burgers-control/CMakeFiles/ROL_example_burgers-control_example_07.dir/example_07.cpp.o] Error 1

And there were a lot of test failures:


93% tests passed, 155 tests failed out of 2285

Label Time Summary:
Amesos               =  21.14 sec (13 tests)
Amesos2              =  10.70 sec (7 tests)
Anasazi              = 123.04 sec (71 tests)
AztecOO              =  19.40 sec (17 tests)
Belos                = 107.90 sec (61 tests)
Domi                 = 211.45 sec (125 tests)
Epetra               =  63.49 sec (61 tests)
EpetraExt            =  15.95 sec (10 tests)
FEI                  =  52.16 sec (43 tests)
Galeri               =   6.37 sec (9 tests)
GlobiPack            =   3.45 sec (6 tests)
Ifpack               =  69.40 sec (53 tests)
Ifpack2              =  51.86 sec (32 tests)
Intrepid             = 281.54 sec (152 tests)
Intrepid2            = 165.49 sec (107 tests)
Isorropia            =   9.23 sec (6 tests)
Kokkos               = 232.21 sec (21 tests)
ML                   =  58.04 sec (34 tests)
MueLu                = 335.85 sec (54 tests)
NOX                  = 159.61 sec (100 tests)
OptiPack             =   7.28 sec (5 tests)
Panzer               =  48.44 sec (125 tests)
Phalanx              =   9.73 sec (15 tests)
Pike                 =   6.59 sec (7 tests)
Piro                 =  29.37 sec (11 tests)
ROL                  = 863.89 sec (112 tests)
RTOp                 =  20.21 sec (24 tests)
Rythmos              = 228.57 sec (82 tests)
SEACAS               =  11.67 sec (8 tests)
STK                  =   0.00 sec (12 tests)
Sacado               = 123.56 sec (290 tests)
Shards               =   1.99 sec (4 tests)
ShyLU                =  10.51 sec (5 tests)
Stokhos              = 114.58 sec (74 tests)
Stratimikos          =  36.63 sec (39 tests)
Teko                 = 221.32 sec (19 tests)
Teuchos              = 106.96 sec (123 tests)
ThreadPool           =  26.56 sec (10 tests)
Thyra                =  89.82 sec (80 tests)
Tpetra               = 150.44 sec (122 tests)
TrilinosCouplings    =   0.00 sec (19 tests)
Triutils             =   2.78 sec (2 tests)
Xpetra               =  56.13 sec (16 tests)
Zoltan               = 255.78 sec (16 tests)
Zoltan2              = 141.88 sec (91 tests)

Total Test time (real) = 1049.18 sec

Because of the build failures, one would expect a lot of "Not Run" tests because the dependent executables don't even exist and there were 106 of those "Not Run" tests. But even ignoring the "Not Run" tests there were still 49 otherwise failing tests:

        534 - Zoltan_ch_simple_parmetis_parallel (Failed)
        1401 - Teko_testdriver_MPI_1 (Failed)
        1403 - Teko_testdriver_tpetra_MPI_1 (Failed)
        1805 - MueLu_FixedMatrixPattern-Tpetra_MPI_4 (Failed)
        1806 - MueLu_StandardReuse-Tpetra_MPI_4 (Failed)
        1807 - MueLu_ReuseSequenceTpetra_MPI_1 (Failed)
        1808 - MueLu_FixedMatrixPattern-Epetra_MPI_4 (Failed)
        1809 - MueLu_StandardReuse-Epetra_MPI_4 (Failed)
        1810 - MueLu_ReuseSequenceEpetra_MPI_1 (Failed)
        1811 - MueLu_MatrixDriver_MPI_4 (Failed)
        1812 - MueLu_Epetra1DLaplace_MPI_4 (Failed)
        1813 - MueLu_Tpetra1DLaplace_MPI_4 (Failed)
        1814 - MueLu_Stratimikos_MPI_4 (Failed)
        1815 - MueLu_Stratimikos2_MPI_4 (Failed)
        1816 - MueLu_LevelWrap-Tpetra_MPI_4 (Failed)
        1817 - MueLu_LevelWrap-Epetra_MPI_4 (Failed)
        1818 - MueLu_BlockCrs-Tpetra_MPI_4 (Failed)
        1819 - MueLu_Simple_MPI_4 (Failed)
        1820 - MueLu_MLParameterList-Epetra_MPI_4 (Failed)
        1821 - MueLu_MLParameterList_Repartition-Epetra_MPI_4 (Failed)
        1826 - MueLu_DriverTpetra_MPI_4 (Failed)
        1827 - MueLu_DriverTpetraILU_MPI_4 (Failed)
        1828 - MueLu_DriverTpetra_Milestone_MPI_4 (Failed)
        1829 - MueLu_RAPScalingTestTpetra_MPI_4 (Failed)
        1830 - MueLu_SmootherScalingTestTpetra_MPI_4 (Failed)
        1835 - MueLu_simple-factory-request-mechanism_MPI_4 (Failed)
        1836 - MueLu_Navier2D_Epetra_MPI_4 (Failed)
        1837 - MueLu_paramlist_MPI_4 (Failed)
        1838 - MueLu_paramlistAdv_MPI_4 (Failed)
        1839 - MueLu_Aggregation_MPI_4 (Failed)
        1840 - MueLu_simple1D-UncoupledAggregation-Tpetra_MPI_4 (Failed)
        1841 - MueLu_simple1D-UncoupledAggregation-Epetra_MPI_4 (Failed)
        1842 - MueLu_Viz3DTpetra_MPI_4 (Failed)
        1843 - MueLu_Viz2DTpetra_MPI_4 (Failed)
        1844 - MueLu_Driver_TogglePFactory_tent_tent_Epetra_MPI_4 (Failed)
        1845 - MueLu_Driver_TogglePFactory_sa_tent_Epetra_MPI_4 (Failed)
        1846 - MueLu_Driver_TogglePFactory_semi_tent_Epetra_MPI_4 (Failed)
        1847 - MueLu_Driver_TogglePFactory_tent_tent_Tpetra_MPI_4 (Failed)
        1848 - MueLu_Driver_TogglePFactory_sa_tent_Tpetra_MPI_4 (Failed)
        1849 - MueLu_Driver_TogglePFactory_semi_tent_Tpetra_MPI_4 (Failed)
        1850 - MueLu_Driver_TogglePFactory_semi_tent_line_Tpetra_MPI_4 (Failed)
        1851 - MueLu_Driver_TogglePFactory_semi_sa_line_easy_Tpetra_MPI_4 (Failed)
        1993 - Stokhos_nox_example_MPI_1 (Failed)
        2008 - Stokhos_uq_handbook_nonlinear_sg_example_MPI_1 (Failed)
        2009 - Stokhos_sacado_example_MPI_1 (Failed)
        2011 - Stokhos_sacado_ensemble_example_MPI_1 (Failed)
        2088 - ROL_example_diode-circuit_example_01_MPI_4 (Failed)
        2122 - ROL_example_binary-design_example_01_MPI_1 (Failed)
        2285 - PikeBlackBox_rxn_MPI_1 (Failed)

I looked at some of the failures and they are all very different with no pattern that I can see.

Here are the more detailed files:

bartlettroscoe commented 7 years ago

(2016/11/14)

To put it bluntly, the strategy of trying to use the SEMS env to create a consistent CI env between Linux and OSX is not going to work. This is shown by the results from the attempt to set this up and test this with the GCC 5.3.0 based stack from the SEMS env shown in the above comments for:

This shows that even with what should be the same compilers, MPI, and TPL builds, you can still get very different behavior on these two platforms. While everything passes on Linux with the SEMS GCC 5.3.0 stack on this branch (see above), STK and ROL don't even build (see above). And with the remaining tests that do build, 49 of them fail on OSX that passed on Linux. Most of the "Failed" (i.e. not "Not Run") tests are for MeuLu but there are also test failures for Zoltan, Teko, ROL, Stokhos and Pike.

What this means is that if we tried to create a CI env based on this setup, when a Mach OSX Trilinos developer pulls a version of Trilinos that fully passed on Linux then that version may not even build on their Mac OSX machine; and if it does build they could get test failures. Likewise, if all of Trilinos tests pass for a Mac OSX developer when they push, they may not when a Trilinos Linux developer pulls them. This is not an effective development and CI strategy.

I see a few different possible approaches to creating development and CI strategy when we must support productive development and CI env for both Linux and OSX developers (ordered from easiest to the hardest to set up and maintain the the infrastructure):

1) Only allow pushing to Trilinos using the checkin-test-sems.sh script from a Sandia RHEL 6.0 machine using the SEMS env: That means that every Trilinos developer with push access would need to have access and time on a Sandia RHEL 6.0 machine in order to push changes to Trilinos. This also means that Mac OSX developers may experience build or test failures whenever they pull updates from the Trilinos 'develop' branch (even if the Linux CI build is 100% clean). This protects basic Trilinos customers that need Linux to work but it is not ideal for Trilinos Mac OSX developers.

2) Create a standard Docker Centos 6.0 container that duplicates the SEMS env for RHEL 6.0 and require Mac OSX developers to use that docker container to test and push Trilinos. This requires the setup and maintenance of a standard Centos 6.0 Docker container with the SEMS env installed on it. But once it was created and made available, then Trilinos OSX developers can just install Docker and then use this container to test and push changes to Trilinos. And if there are failures, they can log into the docker container and fix the problems there and push. Infrastructure would need to be set up to make this easy to fire off and documentation would need to be created to describe how to do it. This is really the same as option-1 but now a Trilinos OSX developer does not need to have access to a separate Linux machine to test and push from. They can just build, test, and push from their own OSX machine. While this approach guarantees that the Linux CI build will almost never fail, it would still allow the OSX CI build to fail because Trilinos Linux developers would not be testing against OSX before they push. This is not good for OSX developers but at lest it would keep Trilinos solid for most of the important Trilinos customers where Linux is the most important platform.

3) Switch Trilinos to the git.git workflow where everyone develops on topic branches and merges into 'next' and only graduates a topic branch to develop when the CI builds on both Linux and OSX pass: That might sound hard from an infrastructure standpoint but it is really not. All that is needed is to set up a single CI build on Linux and OSX that tests the 'next' branch and then leaves it to developers to make sure that the full CI build and tests are passing on Linux and OSX before they "graduate" their individual topic branches. But that requires everyone to ASAP fix failures on their topic branch that breaks the Linux or OSX CI builds and tests. And it requires developers to manually maintain the 'next' branch and graduate their topic branches. Therefore, this puts most the the burden on individual developers and is a fairly complex git workflow. (But this is the workflow used by PETSc.)

4) Set up an automated test and push server that takes every individual push attempt by a developer and tests it against the Linux and OSX CI builds before merging to develop: That can be implemented with the GitHub PR mechanism like is used for the SST and MOOSE (INL) projects but it can also be implemented in other ways as well. But that takes a great deal of infrastructure to set that up and it requires the purchase and maintenance of machines to drive these CI builds. It also requires the setup of infrastructure to report results and allow Trilinos developers to reproduce the failures so that they can fix them. This can take a lot of work and maintenance to set up and support.

Given all of this, for now, I am going to go with option-1 above and do the following:

1) Go back to using the GCC 4.7.3 SEMS env on Linux for the CI build (but make it use GCC 5.3.0 on OSX so people can reproduce failures there) on the topic branch better-ci-branch-482.

2) Clean up and rebase the topic branch better-ci-build-482 and push to 'develop'.

3) Add wiki documentation on usage of checkin-test-sems.sh and state that currently it must be run from a RHEL 6.0 machine with SEMS mounted (or synced).

4) Set up a matching post-push CI server on RHEL 6.0 machine crf450 to post to Trilinos CDash.

After that, everything will be set up for option-1 above, then we can talk where to go from there.

(SideNote: Assuming the build failures on OSX can be resolved easily, then implementing the checkin-test.py option --compare-to-control-build will allow for some existing failing tests but still allow the push (see TriBITSPub/TriBITS#152). That could be used to allow for perhaps pushing from OSX or Linux but this is not ideal. I also fear that this would allow the number of CI test failures to just creep up with no one doing anything about them. But this will be a topic for another Story Issue. We need to get some rational CI build in place and then go from there.)

mhoemmen commented 7 years ago

I think that if an individual developer can test the packages that they need to test using OS X using the check-in test script, they should be allowed to push from OS X. Why do we need policy when developer discretion would do?

jwillenbring commented 7 years ago

I agree with @bartlettroscoe that reverting the Linux build to 4.7.3 and getting things 100% passing on Linux is the right next step. We will have to see where to go from there. This seems like a conversation for an upcoming leaders meeting.

@mhoemmen I don't think that Ross meant that he would fix up the build and then a policy prohibiting pushes from OS X would go into effect pending further action. He said things would be set up for option 1, and then we would need to figure out where to go from there.

mhoemmen commented 7 years ago

@jwillenbring wrote:

@mhoemmen I don't think that Ross meant that he would fix up the build and then a policy prohibiting pushes from OS X would go into effect pending further action. He said things would be set up for option 1, and then we would need to figure out where to go from there.

OK, got it. Thanks for clarifying :-)

bartlettroscoe commented 7 years ago

I think that if an individual developer can test the packages that they need to test using OS X using the check-in test script, they should be allowed to push from OS X. Why do we need policy when developer discretion would do?

@mhoemmen, if you try to test and push from OSX (or at least a machine like gaia), then you will not be able to because STK and ROL don't even build on OSX right now with the standard CI build. Are you not seeing the failing tests for MueLU and other packages that are shown above?

Anyway, after I push this to develop you can it a try and see what happens on your machine.

mhoemmen commented 7 years ago

If I add a test or example to Tpetra that doesn't affect downstream packages, I will use --no-enable-fwd-packages with the check-in test script. If I'm a ROL developer and I can get ROL to build and pass tests with my Mac, why shouldn't I use my Mac to test?

I do most of my development on Linux, so this usually does not affect me. However, Tpetra has a new developer, @tjfulle , who uses Mac. If @tjfulle can get builds to pass on his Mac, why should you stop him from using his Mac? The same should apply for Windows, Linux, AIX, or one's operating system of choice.

kddevin commented 7 years ago

Is there a CMAKE macro identifying the architecture? We know ParMETIS uses different random numbers on Mac vs Linux, so Zoltan's Mac answers don't pass their diff. If there is a CMAKE macro that identifies the Mac, we can disable the particular Zoltan test when running on a Mac. Would disabling that test be helpful?

jwillenbring commented 7 years ago

@mhoemmen I would argue there is an element here that goes beyond getting the tests to pass locally on one person's machine. We would like to have some confidence that the tests will also pass for others using common configurations (or "the configuration" perhaps more accurately). That's why Ross was trying to see if using GCC X.Y.Z on Linux and Mac could give us confidence that everything passing on one machine would lead to an extremely high likelihood of the tests passing on a machine of the other OS. This would be very valuable. Unfortunately, as Ross documented, it seems the SEMS NFS mount isn't going provide that (not bashing the NFS mount, it just doesn't seem that thought is going to work).

bartlettroscoe commented 7 years ago

First note that it does not seem we have even a single OSX nightly build for Trilinos currently:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2016-11-13

That is bad if you are an OSX developer or customer.

If @tjfulle can get builds to pass on his Mac, why should you stop him from using his Mac? The same should apply for Windows, Linux, AIX, or one's operating system of choice.

A couple problems with that:

First, changing Tpetra requires testing Panzer and ROL. Testing Panzer requires enabling STK. STK and ROL do not build on OSX with the standard CI build. And for what does build, there are 49 test failures. So if anyone tries to run the updated more-complete pre-push PT CI build, it will fail for them and stop their push.

Second, just because you have all passing builds and tests on OSX does not mean that you will have all passing builds and tests on Linux and visa versa. That is a very bad development and CI model. That makes OSX very different from say Windows and AIX. No one is doing a lot of development on Windows or AIX so we can rely on nightly builds and a slower turn-around time to fix those. But if you do the majority of your development on Linux or OSX, you need very high confidence than when you pull code from 'develop' that it will build and pass all (or almost all) of tests (but we can relax the 100% passing test criteria just a little once we add the --compare-to-control-build option in TriBITSPub/TriBITS#152).

Is there a CMAKE macro identifying the architecture?

Yes, look at the CMake var CMAKE_HOST_SYSTEM_NAME. For Linux it is Linux, for Mac OSX it is Darwin. You can see that printed for Linux here and OSX here.

We know ParMETIS uses different random numbers on Mac vs Linux, so Zoltan's Mac answers don't pass their diff. If there is a CMAKE macro that identifies the Mac, we can disable the particular Zoltan test when running on a Mac. Would disabling that test be helpful?

Perhaps. The only concern is that by disabling that test, you might allow people to break it on OSX without knowing it. But if Zoltan developers only push from Linux machines, then there no risk of that happening. Anyway, there was just a single failing Zoltan test Zoltan_ch_simple_parmetis_parallel as shown above so that does not appear to be a widespread problem.

I am going to create new specific Trilinos Issues for the build and test failures that I am seeing on OSX and then people can refer to those for more details.

The more immediate problem for OSX is that we have zero automated testing to support and protect it. To that end, I am wondering if someone might give us access to an OSX machine where we can run the new OSX CI build to help protect OSX developers and users? Just having a single Nightly build would be good but if I was an OSX developer, I would like to see a CI build set up.

Anyway, I will send out a survey soon to Trilinos developers to see who develops on OSX and who has access to a Linux RHEL 6 machine with the SEMS env. That should help know how wide-spread the OSX problem is.

mhoemmen commented 7 years ago

@bartlettroscoe wrote:

That is bad if you are an OSX developer or customer.

I agree. We do have customers who care about OS X builds. If we don't want to buy an OS X workstation for us, it might make sense for them to chip in, just as customers have done for Windows in the past.

tjfulle commented 7 years ago

@bartlettroscoe and @mhoemmen, I could run testing on my Mac workstation (maybe not nightly, but weekly?). I've got a pretty good environment set up that seems to be behaving well. I will write up some instructions when I'm more satisfied (but the gist is to avoid homebrew and/or SEMS!)

I've had terrible luck with the SEMS for OS X. I've given up using it. One problem is that the 64 bit build of parmetis seems to be broken on Darwin. If I build my own 64 bit version of parmetis, the developer's standalone tests don't pass (as well as many Trilinos tests, as already noted). If I build my own 32 bit version, the parmetis tests pass and the Trilinos tests that failed due to parmetis also pass. SEMS defaults to the 64 bit build.

Another problem I have encountered is the cmake configure step does not seem to honor the DYLD_LIBRARY_PATH environment variable nor the explicitly defined library paths (-D<TPL_NAME>_LIBRARY_PATH=...). When I run the configure step, the log file says it can find, e.g., libparmetis.dylib, but when it actually tests parmetis to verify the minimum version is installed, I get errors that the linker could not find the library. I have had to disable a few configure time tests to get by this. After that, things build/test as usual.

A related problem is that if a library of the same name (as a library needed for the configure time test) is installed in /usr/local/lib it will get picked up (even if I told cmake to use a different version). This has led to a few false positive and false negative configure results. I now avoid homebrew (and installing anything in /use/local) because I never know which library is being used by cmake/make. But, this necessitates disabling a few configure time tests so that I can get by the aforementioned cmake issues.

Like I said, I need to document all this some where!

bartlettroscoe commented 7 years ago

@tjfulle,

We are really looking to set up standard testing using SEMS on OSX. We don't want any specialization of the env that can't be easily replicated across all OSX machines. Specialization just makes things worse (from a standardized development and testing perspective).

I've had terrible luck with the SEMS for OS X. I've given up using it. One problem is that the 64 bit build of parmetis seems to be broken on Darwin. If I build my own 64 bit version of parmetis, the developer's standalone tests don't pass (as well as many Trilinos tests, as already noted). If I build my own 32 bit version, the parmetis tests pass and the Trilinos tests that failed due to parmetis also pass. SEMS defaults to the 64 bit build.

We really need 64 bit Parmetis to test SEACAS (see #158). What Trilinos tests that depend on ParMETIS are failing for you? Are they the same MueLU tests shown above? We need to work with the SEMS team to fix issues that we find.

Another problem I have encountered is the cmake configure step does not seem to honor the DYLD_LIBRARY_PATH environment variable nor the explicitly defined library paths (-D<TPL_NAME>_LIBRARY_PATH=...). When I run the configure step, the log file says it can find, e.g., libparmetis.dylib, but when it actually tests parmetis to verify the minimum version is installed, I get errors that the linker could not find the library. I have had to disable a few configure time tests to get by this. After that, things build/test as usual.

The file SEMSDevEnv.cmake takes care of finding the right TPLs. We don't want other specializations if we can avoid it. Once I push all of this to 'develop', it would be great if you could test this out and see what happens on your machine using the stock SEMS env on OSX.

A related problem is that if a library of the same name (as a library needed for the configure time test) is installed in /usr/local/lib it will get picked up (even if I told cmake to use a different version). This has led to a few false positive and false negative configure results. I now avoid homebrew (and installing anything in /use/local) because I never know which library is being used by cmake/make. But, this necessitates disabling a few configure time tests so that I can get by the aforementioned cmake issues.

We can look at that in more detail once I push the updated PT SEMS-based CI build for OSX.

mhoemmen commented 7 years ago

I'm a little bit frustrated with this one-size-fits-all approach. Why should @tjfulle have to build Seacas in order to add a test to Tpetra that does not affect downstream code? In what way does that improve the quality of Trilinos?

dridzal commented 7 years ago

Ross, are the ROL build failures something trivial, or is this due to all optional dependencies? If these are minor issues, we could take care of them, although I don't know if this would help with the big-picture tasks here.

Denis

From: Roscoe A. Bartlett notifications@github.com Sent: Monday, November 14, 2016 14:24 To: trilinos/Trilinos Subject: [EXTERNAL] Re: [trilinos/Trilinos] Set up robust portable pre-push and post-push CI tools and process based on the SEMS Dev Env (#482)

First note that it does not seem we have even a single OSX nightly build for Trilinos currently:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2016-11-13

That is bad if you are an OSX developer or customer.

If @tjfullehttps://github.com/tjfulle can get builds to pass on his Mac, why should you stop him from using his Mac? The same should apply for Windows, Linux, AIX, or one's operating system of choice.

A couple problems with that:

Second, just because you have all passing builds and tests on OSX does not mean that you will have all passing builds and tests on Linux and visa versa. That is a very bad development and CI model. That makes OSX very different from say Windows and AIX. No one is doing a lot of development on Windows or AIX so we can rely on nightly builds and a slower turn-around time to fix those. But if you do the majority of your development on Linux or OSX, you need very high confidence than when you pull code from 'develop' that it will build and pass all (or almost all) of tests (but we can relax the 100% passing test criteria just a little once we add the --compare-to-control-build option in TriBITSPub/TriBITS#152https://github.com/TriBITSPub/TriBITS/issues/152).

Is there a CMAKE macro identifying the architecture?

Yes, look at the CMake var CMAKE_HOST_SYSTEM_NAME. For Linux it is Linux, for Mac OSX it is Darwin. You can see that printed for Linux herehttps://github.com/trilinos/Trilinos/files/587417/configure.out.txt and OSX herehttps://github.com/trilinos/Trilinos/files/587538/configure.out.txt.

We know ParMETIS uses different random numbers on Mac vs Linux, so Zoltan's Mac answers don't pass their diff. If there is a CMAKE macro that identifies the Mac, we can disable the particular Zoltan test when running on a Mac. Would disabling that test be helpful?

Perhaps. The only concern is that by disabling that test, you might allow people to break it on OSX without knowing it. But if Zoltan developers only push from Linux machines, then there no risk of that happening. Anyway, there was just a single failing Zoltan test Zoltan_ch_simple_parmetis_parallel as shown abovehttps://github.com/trilinos/Trilinos/issues/482#issuecomment-260158990 so that does not appear to be a widespread problem.

I am going to create new specific Trilinos Issues for the build and test failures that I am seeing on OSX and then people can refer to those for more details.

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/482#issuecomment-260467078, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APE1Cipe35gM-21xHOacYbyqXBdN351cks5q-NF1gaJpZM4JDmHD.

bartlettroscoe commented 7 years ago

I'm a little bit frustrated with this one-size-fits-all approach. Why should @tjfulle have to build Seacas in order to add a test to Tpetra that does not affect downstream code? In what way does that improve the quality of Trilinos?

This is all in one repo and it is continuously integrated. Tpetra impacts downstream packages like Panzer and Panzer needs STK to be fully tested. It is a simple as that. If someone changes Tpetra and a way that breaks Panzer and pushes and then a Panzer developer pulls the updated Trilinos, then they can't work because Panzer is broken. Who does that help? Continuous Integration is Continuous Integration. If you want to do some type of Staged Integration for Trilinos, I would like to know what you propose? CI is super simple and everything else is more complex.

tjfulle commented 7 years ago

@bartlettroscoe and @mhoemmen, I don't mind working towards a more unified build/test criteria (though, as an open source project, where does that put non-sandia developers?). But, in this case, it seems the best approach would be to determine a Mac build environment that actually works and then tell SEMS how to duplicate/push it out. I think I'm pretty close...

bartlettroscoe commented 7 years ago

If you are only adding a test and not changing Tpetra library code, of course you don't need to test any downstream packages. That is what the --no-enable-fwd-packages argument is for.

But if you are changing library code that has any reasonable chance of breaking a CI-integrated downstream package, then it is your duty to test CI-integrated downstream packages before you push. It is all for one and one for all (that is the nature of CI). But assuming that we have it worked out that SEACAS and STK work on the build/test/push platform for the pre-push CI build, they why would someone care if they enable all downstream packages? It just takes a little extra computer time to test them before the push. What is the harm?

There are other workflows that can be used to guarantee that that downstream package developers will not get broken code when they pull from develop but they are all more complex in some way or another than just letting the checkin-test script enable all downstream packages. If Trilinos implements the PR workflow that SST uses, for example, then Trilinos developers will have no choice; every downstream package will get tested even if you only change a comment in a test. There will be zero developer discretion about what gets tested.

We need to discuss this process in detail at the next Trilinos Leaders Meeting.

bartlettroscoe commented 7 years ago

I don't mind working towards a more unified build/test criteria (though, as an open source project, where does that put non-sandia developers?).

A Docker container based on an open Centos 6 or 7 OS will make this 100% available to everyone.

But, in this case, it seems the best approach would be to determine a Mac build environment that actually works and then tell SEMS how to duplicate/push it out. I think I'm pretty close...

That is something we need to work with the SEMS team about.

bartlettroscoe commented 7 years ago

are the ROL build failures something trivial, or is this due to all optional dependencies? If these are minor issues, we could take care of them, although I don't know if this would help with the big-picture tasks here.

@dridzal, the ROL build failures look trivial and easy to fix. It looks like type uint is not supported on this OSX GCC 5.3.0 compiler. But that is just the point. We don't get consistency between Linux and OSX even if we use the exact same compiler sources. That is the fundamental problem.

bartlettroscoe commented 7 years ago

If I add a completely new feature to Tpetra, in a few header file that no downstream packages use yet, and I add a test to Tpetra that exercises the new header file and new feature, why do I need to test Seacas and STK?

I would still test downstream packages in that case. But if you create a new Tpetra subpackage that no downstream packages are using yet and put the code there, then there would be no downstream packages to test. Again, if you only push once a day (and that is all most people need to push), what is the harm in letting the checkin-test.py script test all downstream packages? You computer is going to likely be sitting idle most of the day/night anyway. Might as well put it to good use.

bartlettroscoe commented 7 years ago

This new CI build eliminates the difficulty of building any PT Trilinos package on a machine that has the SEMS env mounted. You just run checkin-test-sems.sh and it magically works (as long as the SEMS env is mounted).

tjfulle commented 7 years ago

I got the failing Muelu tests to pass on Darwin with 64 bit integers in metis. The SEMS version of metis is compiled with 64 bit integers and 64 bit reals. My version is compiled with only 64 bit integers (not reals). It seems the problems stem from 64 bit reals, though I don't know why. I'm not sure if downstream packages need the 64 bit reals in metis.

bartlettroscoe commented 7 years ago

I got the failing Muelu tests to pass on Darwin with 64 bit integers in metis. The SEMS version of metis is compiled with 64 bit integers and 64 bit reals. My version is compiled with only 64 bit integers (not reals). It seems the problems stem from 64 bit reals, though I don't know why. I'm not sure if downstream packages need the 64 bit reals in metis.

That is interesting. I wonder if 64 bit reals (i.e. double precision ) is being used on the Linux builds of Metis? What says the @trilinos/muelu developers?

tjfulle commented 7 years ago

The test suites that ship with metis and parmetis fail to run on Darwin if 64 bit reals are enabled in metis