trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.18k stars 559 forks source link

ROL_adapters_tpetra_test_vector_SimulatedVectorTpetraBatchManagerInterface build error for Tpetra_INST_INT_INT=OFF in ATDM 'cee-rhel6' builds #5447

Closed bartlettroscoe closed 1 year ago

bartlettroscoe commented 5 years ago

CC: @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvers Product Lead), @bartlettroscoe, @fryeguy52

## Next Action Status ## Description As shown [here](https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos&date=2019-06-27&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int-exp&field2=subprojects&compare2=93&value2=ROL), the executable: * `packages/rol/adapters/tpetra/test/vector/CMakeFiles/ROL_adapters_tpetra_test_vector_SimulatedVectorTpetraBatchManagerInterface.dir/test_02.cpp.o` has a build error when turning off global int instantiation in the build: * `Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int`. (NOTE: The motivation for this build is given in [ATDV-174](https://sems-atlassian-srn.sandia.gov/browse/ATDV-174) and #4915.) It shows the build error: ``` In file included from /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagesadapters/tpetra/test/vector/test_02.cpp:48: In file included from /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagessrc/sol/vector/ROL_SimulatedVector.hpp:44: In file included from /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagessrc/vector/ROL_Vector.hpp:54: In file included from /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagessrc/elementwise/ROL_Elementwise_Function.hpp:130: In file included from /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagessrc/elementwise/ROL_Elementwise_Reduce.hpp:49: In file included from /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagessrc/zoo/ROL_Types.hpp:67: In file included from /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagessrc/compatibility/teuchos/rcp/ROL_Ptr.hpp:58: /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packages/teuchos/coreTeuchos_RCP.hpp:288:5: error: cannot initialize a member subobject of type 'const Teuchos::Comm *' with an rvalue of type 'const Teuchos::Comm *' : ptr_(r_ptr.get()), // will not compile if T is not base class of T2 ^ ~~~~~~~~~~~ /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagessrc/compatibility/teuchos/rcp/ROL_Ptr.hpp:83:30: note: in instantiation of function template specialization 'Teuchos::RCP >::RCP >' requested here return Teuchos::rcp( new T(std::forward(args)...) ); ^ /scratch/rabartl/Trilinos.base/BUILDS/ATDM/CEE-RHEL6/CTEST_S/Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int/SRC_AND_BUILD/Trilinos/packagesadapters/tpetra/test/vector/test_02.cpp:94:64: note: in instantiation of function template specialization 'ROL::makePtr >, Teuchos::RCP > &>' requested here ROL::Ptr > bman = ROL::makePtr>(comm); ``` This results in the single Not Run test: * [ROL_adapters_tpetra_test_vector_SimulatedVectorTpetraBatchManagerInterface_MPI_4](https://testing-dev.sandia.gov/cdash/testDetails.php?test=70331906&build=4679405) This reason this build `Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int` shows these extra errors not shown in the build `Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-release-debug-no-global-int` (which we have cleaned up) is that the 'cee-rhel6' builds enable extra packages and TPLs used by SPARC. ## Current Status on CDash The status of these tests/builds for the current testing day can be found at: * [ROL in Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int over last 5 days](https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos&date=2019-06-27&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int&field2=subprojects&compare2=93&value2=ROL&field3=buildstarttime&compare3=83&value3=5%20days%20ago) ## Steps to Reproduce One should be able to reproduce this failure on an CEE RHEL6 machine as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for the system 'cee-rhel6' are provided at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#cee-rhel6-environment The exact commands to reproduce this build error should be: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \ Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_ROL=ON \ $TRILINOS_DIR $ ninja -j16 ```
bartlettroscoe commented 5 years ago

CC: @mhoemmen

@trilinos/rol, from the build error, it looks like this might be pretty easy to fix. Do you want to fix this or should we just disable the building and running of this test in the ATDM Trilinos builds?

Please let me know what you want to do about this.

FYI: Once we get this build cleaned up, all of that ATDM Trilinos builds will have Tpetra_INST_INT_INT=OFF going forward. (In fact, Tpetra will not even support instantiating both long long int and int global ordinals in the future.)

mhoemmen commented 5 years ago

@bartlettroscoe This looks like a ROL bug, no? It appears that ROL was incorrectly using GlobalOrdinal as the template argument of Teuchos::Comm.

bartlettroscoe commented 5 years ago

No response in 12 days so we will disable the build and running of this test going forward in all ATDM Trilinos builds.

dridzal commented 5 years ago

@bartlettroscoe I'm working on this and a few other bug fixes right now. Two PRs are coming, of which the second will address this issue.

bartlettroscoe commented 5 years ago

@dridzal said:

I'm working on this and a few other bug fixes right now. Two PRs are coming, of which the second will address this issue.

That is fine. The PR posted later can re-enable this test.

dridzal commented 5 years ago

Is there a summary of all disabled tests in special builds (ATDM and other) over the last few years? It would be good to start re-enabling some of them.

bartlettroscoe commented 5 years ago

@dridzal asked:

Is there a summary of all disabled tests in special builds (ATDM and other) over the last few years? It would be good to start re-enabling some of them.

https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/ATDM+Builds+of+Trilinos#ATDMBuildsofTrilinos-DeterminingListofTeststhatareCurrentlyDisabled

Also, all of the GitHub issues are kept open with the Disabled Tests label set. So for ROL:

Not sure what more we can do. We can tolerate test runtime failures failing for years at a time but we can't tolerate build errors (that would require a lot more complex analysis methods analyzing CDash results).

bartlettroscoe commented 5 years ago

As shown in the build Trilinos-atdm-cee-rhel6_clang-5.0.1_openmpi-1.10.2_serial_static_opt_no-global-int today, no more build errors and this test is no longer not run.

Adding the label "Disabled Tests" and leaving open as per policy.

dridzal commented 4 years ago

@mhoemmen , what's the reasoning behind hard-typing Teuchos::Comm to int? Job security for year 2030? Joking aside, it seems like another user-specified type would be necessary for full flexibility.

mhoemmen commented 4 years ago

@dridzal Neither Ross nor myself wrote those classes. One of Mike Heroux's students / postdocs wrote them circa 2004. Code has a way of sticking around.

Search Trilinos GitHub issues for an explanation of the likely intent. The MPI standard is evolving to support 64-bit message counts, but ranks will likely remain int. This makes Comm's template parameter not useful.

dridzal commented 4 years ago

but ranks will likely remain int

@mhoemmen , thanks for the explanation; just wanted to make sure that this is the consensus, and that we won't have to change the interface anytime soon. My next PR will address the issue.

mhoemmen commented 4 years ago

@dridzal wrote:

... just wanted to make sure that this is the consensus, ...

That's up to the MPI Committee. It's likely that message sizes won't be coupled to the maximum number of process ranks.

... and that we won't have to change the interface anytime soon.

Teuchos has no owner. It is common property of all its customers, including Trilinos, ROL, Dakota, and other applications and libraries. I don't own Teuchos::Comm any more than any other developer.

Some Trilinos developers have proposed getting rid of Comm's template parameter. That has advantages, but would break backwards compatibility. ROL, as one of the many customers of Teuchos, can express its views about that.

dridzal commented 4 years ago

I would be in favor of removing the template parameter, assuming we can assess and control the impact of this change on dependent code. ROL's use of Teuchos::Comm is frequent yet basic and transparent, and so any potential changes would be quick. Other codes may not have this flexibility.

dridzal commented 4 years ago

@dpkouri , it turns out that this issue is not in the test mentioned above, but in the source code. Both TeuchosBatchManager and TpetraTeuchosBatchManager use an Ordinal template parameter for Teuchos::Comm. As it turns out, only int is supported, so we have to use Teuchos::Comm<int>. The question is if Ordinal is used as a template parameter for other objects; if it is not, we should probably consider removing it. However, I suspect that it is used beyond Teuchos::Comm in these classes. Please advise.

dridzal commented 4 years ago

@dpkouri , and as it turns out, there are several other instances related to BatchManager. Just run the command

grep "Teuchos::Comm<" * -R | grep -v int

in the rol directory. This returns all use cases of Teuchos::Comm that do not include int on the line.

dridzal commented 4 years ago

@mhoemmen @bartlettroscoe : if you run

grep "Teuchos::Comm<" * -R | grep -v int

(see my previous comment) in the Trilinos/packages directory, you'll see a bunch of uses of Teuchos::Comm with non-int template parameters. Having said that, it is possible or even likely that some of those parameters always end up an int, but it may be worth checking with package developers. Note that the above command may only return a partial list.

mhoemmen commented 4 years ago

@dridzal I'm pretty sure that all uses of Comm<T> with T != int are in Thyra or depend on Thyra. Tpetra and packages that depend on it exclusively use Comm<int>. Thus, if you want to harmonize, you could start with Thyra and its dependencies.

github-actions[bot] commented 2 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

mhoemmen commented 2 years ago

begone autobot

github-actions[bot] commented 1 year ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] commented 1 year ago

This issue was closed due to inactivity for 395 days.