trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

Remove warnings from Thyra and Stratimikos impacting CUDA #1140

Closed bartlettroscoe closed 7 years ago

bartlettroscoe commented 7 years ago

This will take care of Thyra and Stratimikos warnings reported in #1133.

bartlettroscoe commented 7 years ago

CC: @trilinos/thyra, @trilinos/stratimikos

I am getting to work on this now.

bartlettroscoe commented 7 years ago

I was able to reproduce warnings for Thyra for the NVCC/CUDA build on shiller using the env provded in #1133. I extracted this into the scripts under Trilinos:

cmake/ctest/drivers/ATTB/attb_cuda_config_base.sh
cmake/ctest/drivers/ATTB/attb_cuda_config_for_drekar.sh
cmake/ctest/drivers/ATTB/load_attb_cuda_env.sh

and used these to reproduce warnings in Thyra and fix them. See details below.

I created the new macro TEUCHOS_NONREACHABLE_RETURN(RETURN_VAL) and used it to remove all of the "statement is unreachable" warnings.

Next I will remove all of the warnings for Stratimikos for NVCC/CUDA.


DETAILED NOTES:

(2017/03/16)

@bathmath provided the pretty complete configure script above. It even has the module loads so it is self-contained. The only piece of info missing is what machine these builds are done on. My guess are the ATTB machines hansen and shiller.

First, I need to separate this configure script into its three logical parts:

1) The setup of the env and modules

2) The specification of the env passed to cmake

3) The list of Trilinos packages to enable/disable

I broke up this script into the thre files:

I pushed this to the branch:

as of commit 1b2f8ce.

I then tried to reproduce the configure and build with:

$ time ./do-configure -DTrilinos_ENABLE_TESTS=ON \
  -DTrilinos_ENABLE_Thyra=ON -DTrilinos_ENABLE_Stratimikos=ON \
  &> configure.out

real    1m56.765s
user    1m5.000s
sys     0m11.434s

$ time make -j10 &> make.out

real    12m31.788s
user    74m36.931s
sys     19m48.565s

That build failed with link failures like:

[100%] Linking CXX executable ThyraCore_DefaultBlockedLinearOpUnitTests.exe
/home/projects/x86-64-haswell/blas/20150602/gcc/4.8.4/lib/libblas.a(xerbla.o): In function `xerbla_':
xerbla.f:(.text+0x52): undefined reference to `_gfortran_st_write'
xerbla.f:(.text+0x5d): undefined reference to `_gfortran_string_len_trim'
xerbla.f:(.text+0x6f): undefined reference to `_gfortran_transfer_character_write'
xerbla.f:(.text+0x7f): undefined reference to `_gfortran_transfer_integer_write'
xerbla.f:(.text+0x87): undefined reference to `_gfortran_st_write_done'
xerbla.f:(.text+0x90): undefined reference to `_gfortran_stop_string'
collect2: error: ld returned 1 exit status
make[3]: *** [packages/thyra/core/test/operator_vector/ThyraCore_DefaultBlockedLinearOpUnitTests.exe] Error 1

Seems like the build configuration is missing -lgfortran. I will try adding that and see what happens. That seemed to fix it (in commit c9c4ef7). I pushed that in the commit c9c4ef7.

Now configuring and building from scratch:

$ time ./do-configure -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Thyra=ON -DTrilinos_ENABLE_Stratimikos=ON &> configure.out && time make -j16 &> make.out ; ~/mailmsg.py "Finished CUDA build for Thyra and Startimikos on shiller"

real    1m46.340s
user    1m2.969s
sys     0m11.209s

real    6m18.293s
user    52m6.997s
sys     22m53.191s

Now that actually built and linked everything.

Now running the tests:

$ ctest -j10

[ ... ]

97% tests passed, 3 tests failed out of 119

Label Time Summary:
Stratimikos    =  43.38 sec (39 tests)
Thyra          =  69.01 sec (80 tests)

Total Test time (real) =  24.22 sec

The following tests FAILED:
         78 - ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_MPI_4 (Failed)
         79 - ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 (Failed)
         80 - ThyraTpetraAdapters_Simple2DTpetraModelEvaluatorUnitTests_MPI_1 (Failed)
Errors while running CTest

real    0m24.276s
user    0m41.490s
sys     0m50.463s

Looks like you can't actually run tests that need CUDA on the build node. You get, for example:

1. TpetraThyraWrappers_double_createVectorSpace_UnitTest ...

 p=0: *** Caught standard std::exception of type 'std::runtime_error' :

  cudaGetDeviceCount( & m_cudaDevCount ) error( cudaErrorInsufficientDriver): CUDA driver version is insufficient for CUDA runtime version
/home/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:207
  Traceback functionality not available

 [FAILED]  (0.000584 sec) TpetraThyraWrappers_double_createVectorSpace_UnitTest
 Location: /home/rabartl/Trilinos.base/Trilinos/packages/thyra/adapters/tpetra/test/TpetraThyraWrappers_UnitTests.cpp:220

So how to run test on shiller? Where do I find this documentation? I did a search on the snl-wiki for "shiller" and I found out where to find documentation on the machine itself. After grabing interactive nodes, I was able to run:

$ ctest -j2 -R ^ThyraTpetraAdapters.*
Test project /home/rabartl/Trilinos.base/BUILDS/CUDA/MPI_RELEASE_CUDA
    Start 80: ThyraTpetraAdapters_Simple2DTpetraModelEvaluatorUnitTests_MPI_1
    Start 79: ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1
1/3 Test #80: ThyraTpetraAdapters_Simple2DTpetraModelEvaluatorUnitTests_MPI_1 ...   Passed   15.06 sec
2/3 Test #79: ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_serial_MPI_1 .....   Passed   15.76 sec
    Start 78: ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_MPI_4
3/3 Test #78: ThyraTpetraAdapters_TpetraThyraWrappersUnitTests_MPI_4 ............   Passed   17.52 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
Thyra    =  48.34 sec (3 tests)

Total Test time (real) =  33.44 sec

Wow, shiller is super slow!

Anyhow, I can finally see the warnings that were produced when building Thyra and Stratimikos tests.

Looking for unique warnings from Teuchos first:

$ grep warning make.out | grep packages/teuchos | sort | uniq | less
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialDenseMatrix.hpp:455:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialDenseMatrix.hpp:659:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialDenseMatrix.hpp:679:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialSymDenseMatrix.hpp:488:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialTriDiMatrix.hpp:413:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialTriDiMatrix.hpp:431:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialTriDiMatrix.hpp:456:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialTriDiMatrix.hpp:468:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialTriDiMatrix.hpp:492:11: warning: ISO C++ does not support variable-length array types [-Wvla]
/home/rabartl/Trilinos.base/Trilinos/packages/teuchos/numerics/src/Teuchos_SerialTriDiMatrix.hpp:571:51: warning: ISO C++ does not support variable-length array types [-Wvla]

I need to create a separate issue for that.

Now to look at Thyra warnings:

$ grep warning make.out | grep packages/thyra | sort | uniq  | wc -l
36

Of these, almost all of them are "statement is unreachable" as shown by:

$ grep warning make.out | grep packages/thyra | grep -v "statement is unreachable" | sort | uniq
/home/rabartl/Trilinos.base/Trilinos/packages/thyra/adapters/epetra/src/Thyra_EpetraOperatorViewExtractorStd.cpp(61): warning: variable "eFwdOp" was set but never used

That leaves 35 "statement is unreachable" warnings.

Many of these "statement is unreachable" warnings come from code like:

Thyra::ModelEvaluatorBase::EDerivativeMultiVectorOrientation
Thyra::convert(
  const EpetraExt::ModelEvaluator::EDerivativeMultiVectorOrientation &mvOrientation
  )
{
  switch(mvOrientation) {
    case EpetraExt::ModelEvaluator::DERIV_MV_BY_COL :
      return ModelEvaluatorBase::DERIV_MV_BY_COL;
    case EpetraExt::ModelEvaluator::DERIV_TRANS_MV_BY_ROW :
      return ModelEvaluatorBase::DERIV_TRANS_MV_BY_ROW;
    default:
      TEUCHOS_TEST_FOR_EXCEPT(true);
  }
  return ModelEvaluatorBase::DERIV_MV_BY_COL; // Should never be called!
}

Grepping for that comment shows there 15 of these:

$ find . -name "*pp" -exec grep -nHi "Should never be called" {} \; | wc -l
15

So that is where I will start.

What I want to do here is to create a macro TEUCHOS_NONREACHABLE_RETURN(RETURN_VAL) and then use this in all such returns.

I removed all of the warnings for Thyra and pushed to the 'develop' branch.

bartlettroscoe commented 7 years ago

I finished cleaning up the warnings from Thyra and Stratimikos for NVCC and pushed to the Trilinos 'develop' branch (see below).

@bathmatt, I am closing this as complete as I am pretty sure that you will not see any more warnings on NVCC on shiller coming from building Thyra or Stratimikos. There are also not any warnings for GCC 4.8.3 as well.


DETAILED NOTES:

(2017/03/17)

I cleaned up all of the warnings for Stratimikos for NVCC and pushed to the branch 'thyra-stratimikos-cuda-warnings-1140'.

Now I want to verify that all warnings for Thyra and Stratimikos are removed for the Trilinos build for Drekar. For that, on shiller, I do:

$ cd /home/rabartl/Trilinos.base/BUILDS/CUDA/MPI_RELEASE_CUDA
$ ln -s ../../../Trilinos/cmake/ctest/drivers/ATTB/attb_cuda_config_base.sh .
$ ln -s ../../../Trilinos/cmake/ctest/drivers/ATTB/attb_cuda_config_for_drekar.sh .

$ rm -r CMake*

$ time ./attb_cuda_config_for_drekar.sh &> configure.out

real    3m32.358s
user    1m40.306s
sys     0m15.161s

$ time make -j10 &> make.out

real    685m28.870s
user    2382m5.324s
sys     62m22.490s

Wow, almost 7.5 hours to build these packages on 10 cores! That is fantastic!

Looking for any remaining Thyra or Stratimikos warnings:

$ grep "warning: " make.out | grep "\(packages/stratimikos\|thyra\|Thyra\)" | sort | uniq  | less
/home/rabartl/Trilinos.base/Trilinos/packages/nox/src-loca/src-thyra/LOCA_Thyra_GroupWrapper.H(66): warning: overloaded virtual function "LOCA::Thyra::Group::operator=" is only par
/home/rabartl/Trilinos.base/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C(486): warning: statement is unreachable
/home/rabartl/Trilinos.base/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C(695): warning: statement is unreachable
/home/rabartl/Trilinos.base/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C(859): warning: statement is unreachable
/home/rabartl/Trilinos.base/Trilinos/packages/xpetra/sup/Utils/Xpetra_ThyraUtils.hpp(942): warning: dynamic initialization in unreachable code
/home/rabartl/Trilinos.base/Trilinos/packages/xpetra/sup/Utils/Xpetra_ThyraUtils.hpp(943): warning: statement is unreachable

I could fix the ones in NOX but I think Roger is already taking care of those (see #1139). As for the warnings in Xpetra, those should be fixed by an Xpetra developer.

Looking at the remaining warnings on this branch and the size of the make output:

[rabartl@shiller01 MPI_RELEASE_CUDA (master)]$ grep "warning: " make.out | wc -l
148033

[rabartl@shiller01 MPI_RELEASE_CUDA (master)]$ grep "warning: " make.out | sort | uniq | wc -l
21409

[rabartl@shiller01 MPI_RELEASE_CUDA (master)]$ du -sh make.out 
562M    make.out

Wow, still over 20K unique warnings!

(2017/03/20)

I then went back on crf450 and built and tested Teuchos, Thyra, and Stratimikos with GCC 4.8.3 with:

$ time ./do-configure -DTrilinos_ENABLE_Teuchos=ON \
  -DTrilinos_ENABLE_Thyra=ON  -DTrilinos_ENABLE_Stratimikos=ON \
  &> configure.out

real    0m13.419s
user    0m8.696s
sys     0m2.616s

$ time make -j32 &> make.out

real    9m50.045s
user    94m45.685s
sys     8m1.349s

$ ctest -j32

...

100% tests passed, 0 tests failed out of 243

Label Time Summary:
Stratimikos    =  29.59 sec (39 tests)
Teuchos        =  17.14 sec (124 tests)
Thyra          =  10.07 sec (80 tests)

Total Test time (real) =   5.49 sec

And this gave no warnings:

grep "warning" make.out | wc -l
0

(2017/03/21)

I pushed these commits to the Trilinos 'develop' branch with the top commit being:

commit 416fc129b8fed5f8c5083dad272feffc7e059f95
Merge: 9104e93 7d90906
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Tue Mar 21 05:42:29 2017 -0600

    Merge branch 'more-warnings-1133-1140' of github.com:bartlettroscoe/Trilinos into develop

    Build/Test Cases Summary
    Enabled Packages: EpetraExt, Stratimikos, TeuchosComm, TeuchosCore, ThyraCore
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=2234,notpassed=0 (100.48 min)
    Other local commits for this build/test group: 7d90906, 8e3a5f45, 48a0a35, 8b6b253, aa8fafa, 148e6b4, 01122bd