trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 568 forks source link

Tpetra: CrsGraph not enough capacity to insert error #6345

Closed stanmoore1 closed 4 years ago

stanmoore1 commented 4 years ago

Bug Report

@trilinos/tpetra

Description

Running EMPIRE on Astra with Trilinos from 11-22-2019, I get an error Tpetra:CrsGraph not enough capacity to insert .... This has the same signature as the so-called "UVM" bug on Sierra. Using an older version of EMPIRE and Trilinos from 11-11-2019 does not fail.

@bathmatt can we get this labeled as a super critical L1 milestone blocker?

rppawlo commented 4 years ago

@kddevin @trilinos/tpetra

stanmoore1 commented 4 years ago

I reproduced this on Stria and got a different error:

*********** Caught Exception: Begin Error Report ***********
../packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp:1175:

Throw number = 2

Throw test that evaluated to true: importLen < requiredImportLen

Tpetra::Details::DistributedNoncontiguousDirectory::getEntriesImpl: On Process 15: The 'imports' array must have length at least 1273674, but its actual length is 1202616.  numRecv: 636837, packetSize: 2, numEntries (# GIDs): 636837, numMissing: 0: distor.getTotalReceiveLength(): 601308.
Distributor description: "Tpetra::Distributor": {How initialized: By createFromRecvs, Parameters: {Send type: Send, Barrier between receives and sends: false, Use distinct tags: true, Debug: false}}.
Please report this bug to the Tpetra developers.
************ Caught Exception: End Error Report ***
stanmoore1 commented 4 years ago

I ran it again on Stria and got yet another different error:

*********** Caught Exception: Begin Error Report ***********
../packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp:878:

Throw number = 2

Throw test that evaluated to true: dirMapLid == LINVALID

Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > constructor: Incoming global index 15295108 does not have a corresponding local index in the Directory Map.  Please report this bug to the Tpetra developers.
************ Caught Exception: End Error Report ************
*********** Caught Exception: Begin Error Report ***********
../packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp:878:
mhoemmen commented 4 years ago

@stanmoore1 Could you please retry with the TPETRA_DEBUG environment variable set to 1 ? That should introduce more barriers and all-reduces that might help with diagnosing the issue.

stanmoore1 commented 4 years ago

I'm swamped with trying to get results for EMPIRE. Can a TPetra developer help with this? @bathmatt and I can help someone get running on Stria.

stanmoore1 commented 4 years ago

This seems to run fine on GPUs with @crtrott's serial scan hack.

stanmoore1 commented 4 years ago

Possibly related to #6237. I'm thoroughly convinced there is a memory bug somewhere.

csiefer2 commented 4 years ago

Send me your build goodies on stria.

kddevin commented 4 years ago

In addition to @csiefer2 's agreement to build on stria, @kddevin will try the following:

For the very short term, @kddevin will write a serial scan and make it a compile-time option. If that fixes the problem, it will be very interesting.

stanmoore1 commented 4 years ago

From @bathmatt:

Debug build:

[stria-login2:74972:0:74972] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11006050)
==== backtrace ====
    0  /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1e5e4) [0xfffef8d3e5e4]
    1  /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1d9cc) [0xfffef8d3d9cc]
    2  [0xffffacad066c]
    3  Trilinos/install-debug/lib/libpercept.so.12(_ZNK3stk4mesh8MetaData8get_partEj+0x28) [0xffff9b20cc68]
    4  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh18unpack_entity_infoERNS_10CommBufferERKNS0_8BulkDataERNS0_9EntityKeyERiRSt6vectorIPNS0_4PartESaISB_EERS9_INS0_8RelationESaISF_EE+0xe0) [0xffff999f0f60]
    5  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData23unpack_not_owned_verifyERNS_10CommSparseERSo+0x328) [0xffff99795ae8]
    6  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData37comm_mesh_verify_parallel_consistencyERSo+0x100) [0xffff9978eac0]
    7  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData22check_mesh_consistencyEv+0x134) [0xffff9978d4b4]
    8  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData45internal_modification_end_for_entity_creationERKSt6vectorINS_8topology6rank_tESaIS4_EENS0_4impl16MeshModification25modification_optimizationE+0x2e8) [0xffff9978d1e8]
    9  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData36modification_end_for_entity_creationERKSt6vectorINS_8topology6rank_tESaIS4_EENS0_4impl16MeshModification25modification_optimizationE+0x3c) [0xffff9978cebc]
   10  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh12create_edgesERNS0_8BulkDataERKNS0_8SelectorEPNS0_4PartE+0xb4c) [0xffff9993228c]
   11  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh12create_edgesERNS0_8BulkDataE+0x54) [0xffff999316d4]
   12  Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh24create_adjacent_entitiesERNS0_8BulkDataERSt6vectorIPNS0_4PartESaIS5_EE+0x19c) [0xffff99930e5c]
   13  Trilinos/install-debug/lib/libpanzer-stk.so.12(_ZN10panzer_stk13STK_Interface13buildSubcellsEv+0x4c) [0xffffa498cdcc]
   14  Trilinos/install-debug/lib/libpanzer-stk.so.12(_ZNK10panzer_stk23STK_ExodusReaderFactory24completeMeshConstructionERNS_13STK_InterfaceEP19ompi_communicator_t+0x834) [0xffffa4972cb4]
===================
csiefer2 commented 4 years ago

I can reproduce @bathmatt's error, though that is an issue with STK and not Tpetra.

stanmoore1 commented 4 years ago

Just to clarify, on Stria if we use a release build it fails in Tpetra, if we use a debug build it fails in STK. On Astra with release build it fails in Tpetra but in a different spot.

mhoemmen commented 4 years ago

@stanmoore1 Does Karen's "serial scan" patch make this pass on Stria or Astra?

stanmoore1 commented 4 years ago

@csiefer2 can you try that out?

stanmoore1 commented 4 years ago

I am seeing crashes on GPUs for a different benchmark. I'm rolling back the Trilinos version to see if that helps.

csiefer2 commented 4 years ago

@stanmoore1 Does Karen's "serial scan" patch make this pass on Stria or Astra?

@csiefer2 can you try that out?

On Stria, no (the code crashes in a different way), but I will try Astra.

stanmoore1 commented 4 years ago

I'm seeing crashes on GPUs even with serial scan hack (the first benchmark I ran was fine, but a second benchmark crashes), so I would be surprised if it fixes the issue on Astra.

csiefer2 commented 4 years ago

Can't reproduce that error, but both patched and unpatched code hang on 2 nodes of astra. This hang appears to be related to #6374

If I turn off multijagged, the unpatched code runs to completion. Ditto for 8 node problem.

If we disable multijagged, @stanmoore1 can reproduce the behavior I see with my binary. This suggests that either (a) code changes or (b) build options are the big difference.

I can get good behavior out my rebuld of @stanmoore1 's version of Trilinos. The CMakeCache.txt files for Trilinos seem to be similar between the two. The app builds differ primarily in terms of non-Trilinos enabled TPLs.

I can now reproduce this error, but only using Stan's build of Trilinos, not using my build of Stan's source. I suspect the copy of Trilinos I grabbed, is not actually the one @stanmoore1 is using.

The 11/11 code outputs a warning from STK saying something about internal sidesets and 'correctness is not guaranteed.' As per Stan, this is SOP for the app.

Verified: 11/22 works and 11/11 does not (deprecated code is off in both cases).

kddevin commented 4 years ago

JFYI: With #6377, Issue #6374 is fixed.

rppawlo commented 4 years ago

JFYI: With #6377, Issue #6374 is fixed.

Thanks @kddevin !

stanmoore1 commented 4 years ago

@csiefer2 helped me realized that on Astra I had an issue with the build paths, which was causing a bug already fixed in Trilinos to be exposed. When I fixed that issue the test no longer crashes on Astra. However, I built latest Trilinos and EMPIRE on Stria and it still crashes (and @bathmatt and @csiefer2 were also able to reproduce the crash on Stria). So I'm not sure why it crashes on Stria but not on Astra.

stanmoore1 commented 4 years ago

Here is the stack trace for the crash I'm seeing on Stria:

[1575915894.295387] [st198:130874:0]      uct_iface.c:57   UCX  WARN  got active message id 0, but no handler installed
[1575915894.295408] [st198:130874:0]      eager_rcv.c:235  UCX  ERROR unexpected sync ack received: tag 30000 ep_ptr 0x8030000000006
[1575915894.295181] [st196:51926:0]      uct_iface.c:57   UCX  WARN  got active message id 0, but no handler installed
[1575915894.295250] [st196:51926:0]      uct_iface.c:57   UCX  WARN  got active message id 0, but no handler installed
[1575915894.295462] [st198:130874:0]      uct_iface.c:57   UCX  WARN  got active message id 13, but no handler installed
[st196:51926:0:51926] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x203000000007c)
==== backtrace ====
[1575915894.295601] [st198:130873:0]      uct_iface.c:57   UCX  WARN  got active message id 0, but no handler installed
    0  /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1e5e4) [0x40005cfce5e4]
    1  /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1d9cc) [0x40005cfcd9cc]
    2  [0x40000005066c]
    3  /opt/atse/libs/arm/openucx/1.5.2/lib/libucp.so.0(ucp_rndv_atp_handler+0x10) [0x40005ce08ad0]
    4  /opt/atse/libs/arm/openucx/1.5.2/lib/libuct.so.0(+0x36c88) [0x40005ce86c88]
    5  /opt/atse/libs/arm/openucx/1.5.2/lib/libucp.so.0(ucp_worker_progress+0x48) [0x40005cdfa988]
    6  /opt/atse/mpi/openmpi3-arm/3.1.4/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x18) [0x40005cd751d8]
    7  /opt/atse/mpi/openmpi3-arm/3.1.4/lib/libopen-pal.so.40(opal_progress+0x4c) [0x40001a74c78c]
    8  /opt/atse/mpi/openmpi3-arm/3.1.4/lib/libmpi.so.40(ompi_request_default_wait_all+0x644) [0x4000139b3484]
    9  /opt/atse/mpi/openmpi3-arm/3.1.4/lib/openmpi/mca_coll_basic.so(mca_coll_basic_neighbor_alltoallv+0x888) [0x40005d137588]
   10  /opt/atse/mpi/openmpi3-arm/3.1.4/lib/libmpi.so.40(MPI_Neighbor_alltoallv+0x230) [0x4000139ee1b0]
   11  lib/libstk_util_parallel.so.12(_ZN3stk13CommNeighbors30perform_neighbor_communicationEP19ompi_communicator_tRKSt6vectorIhSaIhEERKS3_IiSaIiEESB_RS5_RS9_SD_+0x210) [0x40000bf794d0]
   12  lib/libstk_util_parallel.so.12(_ZN3stk13CommNeighbors11communicateEv+0x174) [0x40000bf79674]
   13  lib/libstk_mesh_base.so.12(_ZN3stk4mesh22communicate_field_dataERKNS0_8GhostingERKSt6vectorIPKNS0_9FieldBaseESaIS7_EE+0x10f0) [0x40000bc55b30]
   14  lib/libstk_io.so.12(_ZN3stk2io15StkMeshIoBroker18populate_bulk_dataEv+0x18c) [0x40000ba4488c]
   15  lib/libpanzer-stk.so.12(_ZNK10panzer_stk23STK_ExodusReaderFactory24completeMeshConstructionERNS_13STK_InterfaceEP19ompi_communicator_t+0x38c) [0x400005f4dfcc]
alanw0 commented 4 years ago

You can tell stk not to use the MPI_Neighbor functions by using this configure option: '-DSTK_DISABLE_MPI_NEIGHBOR_COMM:BOOL=OFF'

stanmoore1 commented 4 years ago

I tried '-DSTK_DISABLE_MPI_NEIGHBOR_COMM:BOOL=OFF'

One time it failed with:

Throw test that evaluated to true: curLID == LINVALID

Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos:
:HostSpace> > constructor: Incoming global index 60630416 does not have a corresponding local index in the Directory Map.  Please
report this bug to the Tpetra developers.

The other times it failed with:

[1575924384.597495] [st148:53394:0]      uct_iface.c:57   UCX  WARN  got active message id 0, but no handler installed
[1575924384.597506] [st149:124676:0]      uct_iface.c:57   UCX  WARN  got active message id 0, but no handler installed
[1575924384.597573] [st149:124676:0]      uct_iface.c:57   UCX  WARN  got active message id 0, but no handler installed
[1575924384.597598] [st149:124676:0]      eager_rcv.c:235  UCX  ERROR unexpected sync ack received: tag 30000 ep_ptr 0x8030000000006
[1575924384.597659] [st149:124676:0]      uct_iface.c:57   UCX  WARN  got active message id 13, but no handler installed
jhux2 commented 4 years ago

@stanmoore1 Does a backtrace show where the UCX warnings are coming from?

stanmoore1 commented 4 years ago

I didn't get any core dump or backtrace, even though I set "ulimit -c unlimited".

rppawlo commented 4 years ago

Here's the stack trace:

(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=0x4a69fb00, tinfo=0x8f2f0e0 <typeinfo for std::logic_error>, dest=0x1569930 <_ZNSt11logic_errorD1Ev@plt>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x0000000006452b3c in Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const ()
#2  0x0000000006445318 in Tpetra::Directory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&) const ()
#3  0x0000000005cbe6f4 in Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> Tpetra::createOneToOne<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> const&, Tpetra::Details::TieBreak<int, long long> const&) ()
#4  0x0000000003711c90 in panzer::DOFManager::buildGlobalUnknowns_GUN(Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >&) const ()
#5  0x000000000370baf0 in panzer::DOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#6  0x00000000036f8708 in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::GlobalIndexer> const&, Teuchos::RCP<panzer::FieldPattern const> const&) const ()
#7  0x00000000036fc59c in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#8  0x00000000036fcddc in panzer::BlockedDOFManager::buildGlobalUnknowns() ()
#9  0x00000000025566a0 in empire::ElectroMagneticSolverInterface::ElectroMagneticSolverInterface(MainParameterLists, empire::MeshContainer, empire::utils::TimeStamp&, bool, Teuchos::RCP<empire::utils::MeshEvaluationBase>) ()
#10 0x00000000017f22a4 in void meshSpecificMain<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::MpiComm<int> const> const&, double, MainPicParameterLists&, bool, empire::MeshContainer&, empire::utils::TimeStamp&) ()
#11 0x00000000015c126c in main ()
(gdb) cont
Continuing.
kddevin commented 4 years ago

This stack trace looks like those from the so-called "UVM" parallel scan problem. But this is on stria. Cool!

jhux2 commented 4 years ago

Here's the stack trace:

(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=0x4a69fb00, tinfo=0x8f2f0e0 <typeinfo for std::logic_error>, dest=0x1569930 <_ZNSt11logic_errorD1Ev@plt>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x0000000006452b3c in Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const ()
#2  0x0000000006445318 in Tpetra::Directory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&) const ()
#3  0x0000000005cbe6f4 in Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> Tpetra::createOneToOne<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> const&, Tpetra::Details::TieBreak<int, long long> const&) ()
#4  0x0000000003711c90 in panzer::DOFManager::buildGlobalUnknowns_GUN(Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >&) const ()
#5  0x000000000370baf0 in panzer::DOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#6  0x00000000036f8708 in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::GlobalIndexer> const&, Teuchos::RCP<panzer::FieldPattern const> const&) const ()
#7  0x00000000036fc59c in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#8  0x00000000036fcddc in panzer::BlockedDOFManager::buildGlobalUnknowns() ()
#9  0x00000000025566a0 in empire::ElectroMagneticSolverInterface::ElectroMagneticSolverInterface(MainParameterLists, empire::MeshContainer, empire::utils::TimeStamp&, bool, Teuchos::RCP<empire::utils::MeshEvaluationBase>) ()
#10 0x00000000017f22a4 in void meshSpecificMain<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::MpiComm<int> const> const&, double, MainPicParameterLists&, bool, empire::MeshContainer&, empire::utils::TimeStamp&) ()
#11 0x00000000015c126c in main ()
(gdb) cont
Continuing.

@stanmoore1 @rppawlo And this is with the serial scan patch?

rppawlo commented 4 years ago

@jhu - no (unless it is in develop). I'm using today's trilinos develop branch. Let me try to pull that in.

kddevin commented 4 years ago

No, we didn't merge it to develop. Note that it works around only the one parallel scan that seemed "magical" on vortex. We can work around others similarly if needed.

rppawlo commented 4 years ago

Might need more of that magic. Merged in tpetra_6345 and got further down into the calculation, but seeing a similar failure. This is another instance of building a DOFManager but for faces. I'll try to get line numbers.

(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=0x56670820, tinfo=0x8f2f0e0 <typeinfo for std::logic_error>, dest=0x1569930 <_ZNSt11logic_errorD1Ev@plt>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x0000000006452b3c in Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const
    ()
#2  0x0000000006445318 in Tpetra::Directory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&) const ()
#3  0x0000000005cbe6f4 in Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> Tpetra::createOneToOne<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> const&, Tpetra::Details::TieBreak<int, long long> const&) ()
#4  0x0000000003711c90 in panzer::DOFManager::buildGlobalUnknowns_GUN(Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >&) const ()
#5  0x000000000370baf0 in panzer::DOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#6  0x000000000370afb4 in panzer::DOFManager::buildGlobalUnknowns() ()
#7  0x0000000002276344 in empire::utils::MapContainer::buildFaceMap(int, bool) ()
#8  0x0000000001630938 in empire::utils::MapContainer::getFaceMap(bool) ()
kddevin commented 4 years ago

@rppawlo Is this error seen only when running with empire, or might we be able to reproduce it with a mini-em test?

If it is seen only in empire, would you be interested in trying an experiment to see whether we can make it go away by skipping the use of FixedHashTable in the Directory? The modification is straightforward; we'd be happy to add it as a runtime option in a branch, but it might be faster if you want to just try it.

In Tpetra_DirectoryImpldef.hpp, there is a flag useHashTables; hard-coding it to false will avoid using FixedHashTable in any directory, at the cost of runtime and memory efficiency. This should not be a long-term fix, but it might offer a clue about where things go wrong.

diff --git a/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp b/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp
index 4e8150f..1d5f60a 100644
--- a/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp
+++ b/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp
@@ -597,7 +597,8 @@ namespace Tpetra {
       // switch to a hash table - based implementation.
       const size_t inverseSparsityThreshold = 10;
       useHashTables_ =
-        (dir_numMyEntries >= inverseSparsityThreshold * map.getNodeNumElements());
+// KDD        (dir_numMyEntries >= inverseSparsityThreshold * map.getNodeNumElements());
+               false;  // KDD  force to be false for now

Before suggesting this experiment, I checked that with this patch, all Tpetra tests can pass. (One test times out, but I confirmed that, given enough time, it passes.)

If you'd prefer a branch that you can pull in, we'll be happy to do that for you. Let me know.

stanmoore1 commented 4 years ago

@ktpedre said that MPI on Stria is broken which is causing these issues.

rppawlo commented 4 years ago

@kddevin - once the mpi is cleaned up, I'll rerun the tests. If there are still issues, I will try the above patch. I forwarded you the stria mpi details in a separate email.

stanmoore1 commented 4 years ago

The Stria crashes were due to system issues (not related to Trilinos) which have now been resolved. I'm closing this ticket, but will continue to investigate #6389.