Closed stanmoore1 closed 4 years ago
@kddevin @trilinos/tpetra
I reproduced this on Stria and got a different error:
*********** Caught Exception: Begin Error Report ***********
../packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp:1175:
Throw number = 2
Throw test that evaluated to true: importLen < requiredImportLen
Tpetra::Details::DistributedNoncontiguousDirectory::getEntriesImpl: On Process 15: The 'imports' array must have length at least 1273674, but its actual length is 1202616. numRecv: 636837, packetSize: 2, numEntries (# GIDs): 636837, numMissing: 0: distor.getTotalReceiveLength(): 601308.
Distributor description: "Tpetra::Distributor": {How initialized: By createFromRecvs, Parameters: {Send type: Send, Barrier between receives and sends: false, Use distinct tags: true, Debug: false}}.
Please report this bug to the Tpetra developers.
************ Caught Exception: End Error Report ***
I ran it again on Stria and got yet another different error:
*********** Caught Exception: Begin Error Report ***********
../packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp:878:
Throw number = 2
Throw test that evaluated to true: dirMapLid == LINVALID
Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > constructor: Incoming global index 15295108 does not have a corresponding local index in the Directory Map. Please report this bug to the Tpetra developers.
************ Caught Exception: End Error Report ************
*********** Caught Exception: Begin Error Report ***********
../packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp:878:
@stanmoore1 Could you please retry with the TPETRA_DEBUG
environment variable set to 1 ? That should introduce more barriers and all-reduces that might help with diagnosing the issue.
I'm swamped with trying to get results for EMPIRE. Can a TPetra developer help with this? @bathmatt and I can help someone get running on Stria.
This seems to run fine on GPUs with @crtrott's serial scan hack.
Possibly related to #6237. I'm thoroughly convinced there is a memory bug somewhere.
Send me your build goodies on stria.
In addition to @csiefer2 's agreement to build on stria, @kddevin will try the following:
For the very short term, @kddevin will write a serial scan and make it a compile-time option. If that fixes the problem, it will be very interesting.
From @bathmatt:
Debug build:
[stria-login2:74972:0:74972] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11006050)
==== backtrace ====
0 /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1e5e4) [0xfffef8d3e5e4]
1 /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1d9cc) [0xfffef8d3d9cc]
2 [0xffffacad066c]
3 Trilinos/install-debug/lib/libpercept.so.12(_ZNK3stk4mesh8MetaData8get_partEj+0x28) [0xffff9b20cc68]
4 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh18unpack_entity_infoERNS_10CommBufferERKNS0_8BulkDataERNS0_9EntityKeyERiRSt6vectorIPNS0_4PartESaISB_EERS9_INS0_8RelationESaISF_EE+0xe0) [0xffff999f0f60]
5 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData23unpack_not_owned_verifyERNS_10CommSparseERSo+0x328) [0xffff99795ae8]
6 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData37comm_mesh_verify_parallel_consistencyERSo+0x100) [0xffff9978eac0]
7 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData22check_mesh_consistencyEv+0x134) [0xffff9978d4b4]
8 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData45internal_modification_end_for_entity_creationERKSt6vectorINS_8topology6rank_tESaIS4_EENS0_4impl16MeshModification25modification_optimizationE+0x2e8) [0xffff9978d1e8]
9 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh8BulkData36modification_end_for_entity_creationERKSt6vectorINS_8topology6rank_tESaIS4_EENS0_4impl16MeshModification25modification_optimizationE+0x3c) [0xffff9978cebc]
10 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh12create_edgesERNS0_8BulkDataERKNS0_8SelectorEPNS0_4PartE+0xb4c) [0xffff9993228c]
11 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh12create_edgesERNS0_8BulkDataE+0x54) [0xffff999316d4]
12 Trilinos/install-debug/lib/libstk_mesh_base.so.12(_ZN3stk4mesh24create_adjacent_entitiesERNS0_8BulkDataERSt6vectorIPNS0_4PartESaIS5_EE+0x19c) [0xffff99930e5c]
13 Trilinos/install-debug/lib/libpanzer-stk.so.12(_ZN10panzer_stk13STK_Interface13buildSubcellsEv+0x4c) [0xffffa498cdcc]
14 Trilinos/install-debug/lib/libpanzer-stk.so.12(_ZNK10panzer_stk23STK_ExodusReaderFactory24completeMeshConstructionERNS_13STK_InterfaceEP19ompi_communicator_t+0x834) [0xffffa4972cb4]
===================
I can reproduce @bathmatt's error, though that is an issue with STK and not Tpetra.
Just to clarify, on Stria if we use a release build it fails in Tpetra, if we use a debug build it fails in STK. On Astra with release build it fails in Tpetra but in a different spot.
@stanmoore1 Does Karen's "serial scan" patch make this pass on Stria or Astra?
@csiefer2 can you try that out?
I am seeing crashes on GPUs for a different benchmark. I'm rolling back the Trilinos version to see if that helps.
@stanmoore1 Does Karen's "serial scan" patch make this pass on Stria or Astra?
@csiefer2 can you try that out?
On Stria, no (the code crashes in a different way), but I will try Astra.
I'm seeing crashes on GPUs even with serial scan hack (the first benchmark I ran was fine, but a second benchmark crashes), so I would be surprised if it fixes the issue on Astra.
Can't reproduce that error, but both patched and unpatched code hang on 2 nodes of astra. This hang appears to be related to #6374
If I turn off multijagged, the unpatched code runs to completion. Ditto for 8 node problem.
If we disable multijagged, @stanmoore1 can reproduce the behavior I see with my binary. This suggests that either (a) code changes or (b) build options are the big difference.
I can get good behavior out my rebuld of @stanmoore1 's version of Trilinos. The CMakeCache.txt files for Trilinos seem to be similar between the two. The app builds differ primarily in terms of non-Trilinos enabled TPLs.
I can now reproduce this error, but only using Stan's build of Trilinos, not using my build of Stan's source. I suspect the copy of Trilinos I grabbed, is not actually the one @stanmoore1 is using.
The 11/11 code outputs a warning from STK saying something about internal sidesets and 'correctness is not guaranteed.' As per Stan, this is SOP for the app.
Verified: 11/22 works and 11/11 does not (deprecated code is off in both cases).
JFYI: With #6377, Issue #6374 is fixed.
JFYI: With #6377, Issue #6374 is fixed.
Thanks @kddevin !
@csiefer2 helped me realized that on Astra I had an issue with the build paths, which was causing a bug already fixed in Trilinos to be exposed. When I fixed that issue the test no longer crashes on Astra. However, I built latest Trilinos and EMPIRE on Stria and it still crashes (and @bathmatt and @csiefer2 were also able to reproduce the crash on Stria). So I'm not sure why it crashes on Stria but not on Astra.
Here is the stack trace for the crash I'm seeing on Stria:
[1575915894.295387] [st198:130874:0] uct_iface.c:57 UCX WARN got active message id 0, but no handler installed
[1575915894.295408] [st198:130874:0] eager_rcv.c:235 UCX ERROR unexpected sync ack received: tag 30000 ep_ptr 0x8030000000006
[1575915894.295181] [st196:51926:0] uct_iface.c:57 UCX WARN got active message id 0, but no handler installed
[1575915894.295250] [st196:51926:0] uct_iface.c:57 UCX WARN got active message id 0, but no handler installed
[1575915894.295462] [st198:130874:0] uct_iface.c:57 UCX WARN got active message id 13, but no handler installed
[st196:51926:0:51926] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x203000000007c)
==== backtrace ====
[1575915894.295601] [st198:130873:0] uct_iface.c:57 UCX WARN got active message id 0, but no handler installed
0 /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1e5e4) [0x40005cfce5e4]
1 /opt/atse/libs/arm/openucx/1.5.2/lib/libucs.so.0(+0x1d9cc) [0x40005cfcd9cc]
2 [0x40000005066c]
3 /opt/atse/libs/arm/openucx/1.5.2/lib/libucp.so.0(ucp_rndv_atp_handler+0x10) [0x40005ce08ad0]
4 /opt/atse/libs/arm/openucx/1.5.2/lib/libuct.so.0(+0x36c88) [0x40005ce86c88]
5 /opt/atse/libs/arm/openucx/1.5.2/lib/libucp.so.0(ucp_worker_progress+0x48) [0x40005cdfa988]
6 /opt/atse/mpi/openmpi3-arm/3.1.4/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x18) [0x40005cd751d8]
7 /opt/atse/mpi/openmpi3-arm/3.1.4/lib/libopen-pal.so.40(opal_progress+0x4c) [0x40001a74c78c]
8 /opt/atse/mpi/openmpi3-arm/3.1.4/lib/libmpi.so.40(ompi_request_default_wait_all+0x644) [0x4000139b3484]
9 /opt/atse/mpi/openmpi3-arm/3.1.4/lib/openmpi/mca_coll_basic.so(mca_coll_basic_neighbor_alltoallv+0x888) [0x40005d137588]
10 /opt/atse/mpi/openmpi3-arm/3.1.4/lib/libmpi.so.40(MPI_Neighbor_alltoallv+0x230) [0x4000139ee1b0]
11 lib/libstk_util_parallel.so.12(_ZN3stk13CommNeighbors30perform_neighbor_communicationEP19ompi_communicator_tRKSt6vectorIhSaIhEERKS3_IiSaIiEESB_RS5_RS9_SD_+0x210) [0x40000bf794d0]
12 lib/libstk_util_parallel.so.12(_ZN3stk13CommNeighbors11communicateEv+0x174) [0x40000bf79674]
13 lib/libstk_mesh_base.so.12(_ZN3stk4mesh22communicate_field_dataERKNS0_8GhostingERKSt6vectorIPKNS0_9FieldBaseESaIS7_EE+0x10f0) [0x40000bc55b30]
14 lib/libstk_io.so.12(_ZN3stk2io15StkMeshIoBroker18populate_bulk_dataEv+0x18c) [0x40000ba4488c]
15 lib/libpanzer-stk.so.12(_ZNK10panzer_stk23STK_ExodusReaderFactory24completeMeshConstructionERNS_13STK_InterfaceEP19ompi_communicator_t+0x38c) [0x400005f4dfcc]
You can tell stk not to use the MPI_Neighbor functions by using this configure option: '-DSTK_DISABLE_MPI_NEIGHBOR_COMM:BOOL=OFF'
I tried '-DSTK_DISABLE_MPI_NEIGHBOR_COMM:BOOL=OFF'
One time it failed with:
Throw test that evaluated to true: curLID == LINVALID
Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos:
:HostSpace> > constructor: Incoming global index 60630416 does not have a corresponding local index in the Directory Map. Please
report this bug to the Tpetra developers.
The other times it failed with:
[1575924384.597495] [st148:53394:0] uct_iface.c:57 UCX WARN got active message id 0, but no handler installed
[1575924384.597506] [st149:124676:0] uct_iface.c:57 UCX WARN got active message id 0, but no handler installed
[1575924384.597573] [st149:124676:0] uct_iface.c:57 UCX WARN got active message id 0, but no handler installed
[1575924384.597598] [st149:124676:0] eager_rcv.c:235 UCX ERROR unexpected sync ack received: tag 30000 ep_ptr 0x8030000000006
[1575924384.597659] [st149:124676:0] uct_iface.c:57 UCX WARN got active message id 13, but no handler installed
@stanmoore1 Does a backtrace show where the UCX warnings are coming from?
I didn't get any core dump or backtrace, even though I set "ulimit -c unlimited".
Here's the stack trace:
(gdb) bt
#0 __cxxabiv1::__cxa_throw (obj=0x4a69fb00, tinfo=0x8f2f0e0 <typeinfo for std::logic_error>, dest=0x1569930 <_ZNSt11logic_errorD1Ev@plt>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1 0x0000000006452b3c in Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const ()
#2 0x0000000006445318 in Tpetra::Directory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&) const ()
#3 0x0000000005cbe6f4 in Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> Tpetra::createOneToOne<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> const&, Tpetra::Details::TieBreak<int, long long> const&) ()
#4 0x0000000003711c90 in panzer::DOFManager::buildGlobalUnknowns_GUN(Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >&) const ()
#5 0x000000000370baf0 in panzer::DOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#6 0x00000000036f8708 in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::GlobalIndexer> const&, Teuchos::RCP<panzer::FieldPattern const> const&) const ()
#7 0x00000000036fc59c in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#8 0x00000000036fcddc in panzer::BlockedDOFManager::buildGlobalUnknowns() ()
#9 0x00000000025566a0 in empire::ElectroMagneticSolverInterface::ElectroMagneticSolverInterface(MainParameterLists, empire::MeshContainer, empire::utils::TimeStamp&, bool, Teuchos::RCP<empire::utils::MeshEvaluationBase>) ()
#10 0x00000000017f22a4 in void meshSpecificMain<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::MpiComm<int> const> const&, double, MainPicParameterLists&, bool, empire::MeshContainer&, empire::utils::TimeStamp&) ()
#11 0x00000000015c126c in main ()
(gdb) cont
Continuing.
This stack trace looks like those from the so-called "UVM" parallel scan problem. But this is on stria. Cool!
Here's the stack trace:
(gdb) bt #0 __cxxabiv1::__cxa_throw (obj=0x4a69fb00, tinfo=0x8f2f0e0 <typeinfo for std::logic_error>, dest=0x1569930 <_ZNSt11logic_errorD1Ev@plt>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80 #1 0x0000000006452b3c in Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const () #2 0x0000000006445318 in Tpetra::Directory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&) const () #3 0x0000000005cbe6f4 in Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> Tpetra::createOneToOne<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> const&, Tpetra::Details::TieBreak<int, long long> const&) () #4 0x0000000003711c90 in panzer::DOFManager::buildGlobalUnknowns_GUN(Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >&) const () #5 0x000000000370baf0 in panzer::DOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) () #6 0x00000000036f8708 in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::GlobalIndexer> const&, Teuchos::RCP<panzer::FieldPattern const> const&) const () #7 0x00000000036fc59c in panzer::BlockedDOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) () #8 0x00000000036fcddc in panzer::BlockedDOFManager::buildGlobalUnknowns() () #9 0x00000000025566a0 in empire::ElectroMagneticSolverInterface::ElectroMagneticSolverInterface(MainParameterLists, empire::MeshContainer, empire::utils::TimeStamp&, bool, Teuchos::RCP<empire::utils::MeshEvaluationBase>) () #10 0x00000000017f22a4 in void meshSpecificMain<empire::MeshTraits<shards::Tetrahedron<4u>, 1> >(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::MpiComm<int> const> const&, double, MainPicParameterLists&, bool, empire::MeshContainer&, empire::utils::TimeStamp&) () #11 0x00000000015c126c in main () (gdb) cont Continuing.
@stanmoore1 @rppawlo And this is with the serial scan patch?
@jhu - no (unless it is in develop). I'm using today's trilinos develop branch. Let me try to pull that in.
No, we didn't merge it to develop. Note that it works around only the one parallel scan that seemed "magical" on vortex. We can work around others similarly if needed.
Might need more of that magic. Merged in tpetra_6345 and got further down into the calculation, but seeing a similar failure. This is another instance of building a DOFManager but for faces. I'll try to get line numbers.
(gdb) bt
#0 __cxxabiv1::__cxa_throw (obj=0x56670820, tinfo=0x8f2f0e0 <typeinfo for std::logic_error>, dest=0x1569930 <_ZNSt11logic_errorD1Ev@plt>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1 0x0000000006452b3c in Tpetra::Details::DistributedNoncontiguousDirectory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getEntriesImpl(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&, Teuchos::ArrayView<int> const&, bool) const
()
#2 0x0000000006445318 in Tpetra::Directory<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::getDirectoryEntries(Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Teuchos::ArrayView<long long const> const&, Teuchos::ArrayView<int> const&) const ()
#3 0x0000000005cbe6f4 in Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> Tpetra::createOneToOne<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >(Teuchos::RCP<Tpetra::Map<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const> const&, Tpetra::Details::TieBreak<int, long long> const&) ()
#4 0x0000000003711c90 in panzer::DOFManager::buildGlobalUnknowns_GUN(Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> > const&, Tpetra::MultiVector<long long, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >&) const ()
#5 0x000000000370baf0 in panzer::DOFManager::buildGlobalUnknowns(Teuchos::RCP<panzer::FieldPattern const> const&) ()
#6 0x000000000370afb4 in panzer::DOFManager::buildGlobalUnknowns() ()
#7 0x0000000002276344 in empire::utils::MapContainer::buildFaceMap(int, bool) ()
#8 0x0000000001630938 in empire::utils::MapContainer::getFaceMap(bool) ()
@rppawlo Is this error seen only when running with empire, or might we be able to reproduce it with a mini-em test?
If it is seen only in empire, would you be interested in trying an experiment to see whether we can make it go away by skipping the use of FixedHashTable in the Directory? The modification is straightforward; we'd be happy to add it as a runtime option in a branch, but it might be faster if you want to just try it.
In Tpetra_DirectoryImpldef.hpp, there is a flag useHashTables; hard-coding it to false will avoid using FixedHashTable in any directory, at the cost of runtime and memory efficiency. This should not be a long-term fix, but it might offer a clue about where things go wrong.
diff --git a/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp b/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp
index 4e8150f..1d5f60a 100644
--- a/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp
+++ b/packages/tpetra/core/src/Tpetra_DirectoryImpl_def.hpp
@@ -597,7 +597,8 @@ namespace Tpetra {
// switch to a hash table - based implementation.
const size_t inverseSparsityThreshold = 10;
useHashTables_ =
- (dir_numMyEntries >= inverseSparsityThreshold * map.getNodeNumElements());
+// KDD (dir_numMyEntries >= inverseSparsityThreshold * map.getNodeNumElements());
+ false; // KDD force to be false for now
Before suggesting this experiment, I checked that with this patch, all Tpetra tests can pass. (One test times out, but I confirmed that, given enough time, it passes.)
If you'd prefer a branch that you can pull in, we'll be happy to do that for you. Let me know.
@ktpedre said that MPI on Stria is broken which is causing these issues.
@kddevin - once the mpi is cleaned up, I'll rerun the tests. If there are still issues, I will try the above patch. I forwarded you the stria mpi details in a separate email.
The Stria crashes were due to system issues (not related to Trilinos) which have now been resolved. I'm closing this ticket, but will continue to investigate #6389.
Bug Report
@trilinos/tpetra
Description
Running EMPIRE on Astra with Trilinos from 11-22-2019, I get an error
Tpetra:CrsGraph not enough capacity to insert ...
. This has the same signature as the so-called "UVM" bug on Sierra. Using an older version of EMPIRE and Trilinos from 11-11-2019 does not fail.@bathmatt can we get this labeled as a super critical L1 milestone blocker?