idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.77k stars 1.05k forks source link

DTK Transfer tests failing in parallel #5995

Closed permcody closed 8 years ago

permcody commented 8 years ago

@rppawlo - We started running our DTK tests in parallel and the interpolation transfers are failing (seg fault). I just fired one of them up and they are way down in the Trilinos code base. The DTK code at the MOOSE level is fairly straight forward so I waited to touch base before we try to dig into this. All of our other Multiapp and Transfer tests pass in parallel including the other DTK tests.

Anyway, maybe another set of eyes can help out here.

Here's a stack track from one of our tests run on two processors: https://github.com/permcody/moose/blob/devel/test/tests/transfers/multiapp_dtk_interpolation_transfer/master.i

(lldb) bt
* thread #1: tid = 0xb93566, 0x0000000108ae1446 libmoose-oprof.0.dylib`DataTransferKit::SharedDomainMap<DataTransferKit::MeshContainer<unsigned long>, DataTransferKit::MeshContainer<unsigned long> >::setup(Teuchos::RCP<DataTransferKit::MeshManager<DataTransferKit::MeshContainer<unsigned long> > > const&, Teuchos::RCP<DataTransferKit::FieldManager<DataTransferKit::MeshContainer<unsigned long> > > const&, double) [inlined] Teuchos::ArrayRCP<Teuchos::RCP<DataTransferKit::MeshContainer<unsigned long> > >::size() const at Teuchos_ArrayRCP.hpp:732, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x20)
  * frame #0: 0x0000000108ae1446 libmoose-oprof.0.dylib`DataTransferKit::SharedDomainMap<DataTransferKit::MeshContainer<unsigned long>, DataTransferKit::MeshContainer<unsigned long> >::setup(Teuchos::RCP<DataTransferKit::MeshManager<DataTransferKit::MeshContainer<unsigned long> > > const&, Teuchos::RCP<DataTransferKit::FieldManager<DataTransferKit::MeshContainer<unsigned long> > > const&, double) [inlined] Teuchos::ArrayRCP<Teuchos::RCP<DataTransferKit::MeshContainer<unsigned long> > >::size() const at Teuchos_ArrayRCP.hpp:732
    frame #1: 0x0000000108ae1446 libmoose-oprof.0.dylib`DataTransferKit::SharedDomainMap<DataTransferKit::MeshContainer<unsigned long>, DataTransferKit::MeshContainer<unsigned long> >::setup(Teuchos::RCP<DataTransferKit::MeshManager<DataTransferKit::MeshContainer<unsigned long> > > const&, Teuchos::RCP<DataTransferKit::FieldManager<DataTransferKit::MeshContainer<unsigned long> > > const&, double) [inlined] DataTransferKit::MeshManager<DataTransferKit::MeshContainer<unsigned long> >::getNumBlocks(this=0x0000000000000000) const at DTK_MeshManager.hpp:111
    frame #2: 0x0000000108ae1446 libmoose-oprof.0.dylib`DataTransferKit::SharedDomainMap<DataTransferKit::MeshContainer<unsigned long>, DataTransferKit::MeshContainer<unsigned long> >::setup(Teuchos::RCP<DataTransferKit::MeshManager<DataTransferKit::MeshContainer<unsigned long> > > const&, Teuchos::RCP<DataTransferKit::FieldManager<DataTransferKit::MeshContainer<unsigned long> > > const&, double) [inlined] DataTransferKit::ClassicMesh<DataTransferKit::MeshContainer<unsigned long> >::getNumBlocks(this=0x00007f99faa00c00) const + 4 at DTK_ClassicMesh.hpp:83
    frame #3: 0x0000000108ae1442 libmoose-oprof.0.dylib`DataTransferKit::SharedDomainMap<DataTransferKit::MeshContainer<unsigned long>, DataTransferKit::MeshContainer<unsigned long> >::setup(this=0x00007f99faa00a20, source_mesh_manager=<unavailable>, target_coord_manager=0x00007fff57906060, tolerance=<unavailable>) + 1650 at DTK_SharedDomainMap_def.hpp:164
    frame #4: 0x0000000108adf652 libmoose-oprof.0.dylib`libMesh::DTKInterpolationHelper::transferWithOffset(this=<unavailable>, from=0, to=0, from_var=0x0000000000000000, to_var=0x00007f99f8db1660, from_offset=0x00007fff57906518, to_offset=<unavailable>, from_mpi_comm=<unavailable>, to_mpi_comm=<unavailable>) + 3842 at DTKInterpolationHelper.C:142
    frame #5: 0x0000000108b4a342 libmoose-oprof.0.dylib`MultiAppDTKInterpolationTransfer::execute(this=0x00007f99f8dab5c0) + 402 at MultiAppDTKInterpolationTransfer.C:74
    frame #6: 0x00000001087864d1 libmoose-oprof.0.dylib`FEProblem::execMultiApps(this=0x00007f99fa01ce00, type=<unavailable>, auto_advance=true) + 1393 at FEProblem.C:2809
    frame #7: 0x00000001088c9edb libmoose-oprof.0.dylib`Transient::solveStep(this=0x00007f99f8dae610, input_dt=<unavailable>) + 187 at Transient.C:378
    frame #8: 0x00000001088c9cfb libmoose-oprof.0.dylib`Transient::takeStep(this=0x00007f99f8dae610, input_dt=-1) + 411 at Transient.C:330
    frame #9: 0x00000001088c99b9 libmoose-oprof.0.dylib`Transient::execute(this=0x00007f99f8dae610) + 89 at Transient.C:250
    frame #10: 0x00000001082fa502 moose_test-oprof`main(argc=<unavailable>, argv=<unavailable>) + 114 at main.C:34
    frame #11: 0x00007fff941445ad libdyld.dylib`start + 1
rppawlo commented 8 years ago

@permcody I just built the lastest github version of moose and the tests all passed (see below). Can you tell me what version of Trilinos and DTK you are using. Also, did you enable c++11 support?

transfers/multiapp_dtk_userobject_transfer.check_error.................................. skipped (DTK!=False) transfers/multiapp_dtk_interpolation_transfer.test........................................................ OK transfers/multiapp_dtk_interpolation_transfer.tosub....................................................... OK transfers/multiapp_dtk_interpolation_transfer.multilevel.................................................. OK transfers/multiapp_dtk_userobject_transfer.test........................................................... OK transfers/3d_to_2d_dtk_interpolation.test................................................................. OK

rppawlo commented 8 years ago

@permcody I set up a new build with votd Trilinos, DTK, MOOSE - all the very latest from github repos and all tests passed including dtk ones. I think this is a Trilinos/DTK versioning and/or configuration issue. Attached is my trilinos configure script for moose builds. Also had to change the update_and_rebuild_libmesh.sh slightly to enable c++11 - the diff is below.


Ran 1051 tests in 257.6 seconds 1051 passed, 38 skipped, 0 pending, 0 failed

[rppawlo@gge moose]$ git diff
diff --git a/scripts/update_and_rebuild_libmesh.sh b/scripts/update_and_rebuild_libmesh.sh
index f3f14fa..cddf785 100755
--- a/scripts/update_and_rebuild_libmesh.sh
+++ b/scripts/update_and_rebuild_libmesh.sh
@@ -54,9 +54,18 @@ cd build
              --enable-silent-rules \
              --enable-unique-id \
              --disable-warnings \
-             --disable-cxx11 \
+             --enable-cxx11 \
              --enable-unique-ptr \
              --enable-openmp \
+             --enable-trilinos \
+             --enable-static \
+             --disable-pthread \
+             --disable-shared \
+             LDFLAGS="-L/home/rppawlo/tpls/gcc/5.2.0/lib64" \
+             --with-trilinos=$ROGER_TRILINOS_BASE_PATH \
+             CXX=$ROGER_MPICH_BASE_PATH/bin/mpicxx \
+             CC=$ROGER_MPICH_BASE_PATH/bin/mpicc \
              $DISABLE_TIMESTAMPS $*

 # let LIBMESH_JOBS be either MOOSE_JOBS, or 1 if MOOSE_JOBS

build_trilinos_for_libmesh.txt

permcody commented 8 years ago

Thanks @rppawlo. I've was out all last week and on travel this week. If this is working for you, I'll double check to make sure I haven't fouled anything up on the configuration. It might be another week...

permcody commented 8 years ago

@rppawlo - I just got back into the office today after being out most of the last two weeks. Did you run these tests in parallel? They always pass in serial for us, just not in parallel

./run_tests -p 2
rppawlo commented 8 years ago

They do fail in parallel with a seg fault (I just assumed the test suite ran in parallel by default - sorry). This was against the current github repo version of moose. If you want I can check on the current version of moose in the bison-casl repo that casl codes use. Our acceptance tests for CASL codes are passing and they always run in parallel. Not sure what is happening here.

permcody commented 8 years ago

We'll add a parallel target too. Moose has one but I guess this configuration doesn't. Thanks On Mon, Dec 7, 2015 at 8:02 AM rppawlo notifications@github.com wrote:

They do fail in parallel with a seg fault (I just assumed the test suite ran in parallel by default - sorry). This was against the current github repo version of moose. If you want I can check on the current version of moose in the bison-casl repo that casl codes use. Our acceptance tests for CASL codes are passing and they always run in parallel. Not sure what is happening here.

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/5995#issuecomment-162549308.

rppawlo commented 8 years ago

CASL maintains its own git fork of trilinos and dtk, so they often lag the corresponding github repos. The current commit hash for the repos for Trilinos and DTK used by casl parallel builds are: Trilinos: 2e4cf4ba7e DTK: e6e920fa342d

I'm going to back up to these and see if the moose dtk tests pass at this point. The CASL repos also have some local changes not pushed back into the github repos so I may also directly test against the casl repos too to make sure it isn't a problem there too.

rppawlo commented 8 years ago

Went back to current working casl versions of trilinos/dtk and we get the following:

restart/pointer_restart_errors.pointer_load_error2.................................. FAILED (NO EXPECTED ERR)
transfers/multiapp_dtk_interpolation_transfer.test............................................ FAILED (CRASH)
transfers/multiapp_dtk_interpolation_transfer.multilevel...................................... FAILED (CRASH)

So it seems casl use case doesn't trigger the failure. I will try to get more info on the seg fault.

rppawlo commented 8 years ago

@permcody I think we have identified the problem. The moose test is using a DTK map that is not used/tested by casl and so was not updated for the DTK refactor. Since the old maps are deprecated, we have only been updating them for backwards compatibility if requested. The SharedVolumeMap used in the unit test fell into this category. Stuart is working on a push to DTK today. Once it is in I will test and send an update.

permcody commented 8 years ago

OK - sounds good

On Tue, Dec 8, 2015 at 12:28 PM rppawlo notifications@github.com wrote:

@permcody https://github.com/permcody I think we have identified the problem. The moose test is using a DTK map that is not used/tested by casl and so was not updated for the DTK refactor. Since the old maps are deprecated, we have only been updating them for backwards compatibility if requested. The SharedVolumeMap used in the unit test fell into this category. Stuart is working on a push to DTK today. Once it is in I will test and send an update.

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/5995#issuecomment-162989515.

rppawlo commented 8 years ago

Hi Cody - I was on travel last week and am just getting back to this. We hope to issue a pull request soon.

permcody commented 8 years ago

Closing all the DTK tickets in favor of #7123. We have several new plans in the works for DTK.