Open GiudGiud opened 2 months ago
To summarize the above, the issue only occurs in dbg
builds on linux.
The issue seems to be a bug in HDF5 when creating exodus files. The function where the FPE is occurring is HDF5's H5Eset_auto2
. The backtrace is listed below. While stepping through code in gdb, the following command has been useful in determining whether a FPE has occurred yet:
call (int)fetestexcept(FE_DIVBYZERO | FE_INVALID)
This command returns 1 if a FPE has occurred and 0 otherwise.
The version of HDF5 in the conda environments I used is 1.14.3. I created a conda environment using moose-dev hdf5=1.12.1
, and the input ran without crashing. This indicates a bug introduced in newer versions of HDF5. Tagging @milljm so he is aware there seems to be a bug in the version of hdf5 shipped in our conda packages.
I have not pinpointed the bug within H5Eset_auto2
because I cannot step into it using gdb due to HDF5 not being compiled with debugging symbols. However, @roystgnr and I were able to reproduce the bug in libMesh's unit test suite. Roy will build a more recent version of HDF5 to see if he can further pinpoint the issue. If now, I will build HDF5 myself and point petsc to use it. Then I should be able to get further with the debugger.
Thanks to @roystgnr, @milljm, and @lindsayad for their help!
Backtrace
#0 0x00007fffdf6a2c70 in H5Eset_auto2 () from /home/behnpa/miniforge/envs/moose_with_libmesh_petsc/lib/libhdf5.so.310
#1 0x00007fffe81a5958 in set_auto (func=0x0, client_data=0x0) at ../../../../../contrib/netcdf/netcdf-c-4.6.2/libhdf5/hdf5internal.c:67
#2 0x00007fffe81a596d in nc4_hdf5_initialize () at ../../../../../contrib/netcdf/netcdf-c-4.6.2/libhdf5/hdf5internal.c:78
#3 0x00007fffe81aff4d in NC4_initialize () at ../../../../../contrib/netcdf/netcdf-c-4.6.2/libsrc4/nc4dispatch.c:139
#4 0x00007fffe8145615 in nc_initialize () at ../../../../../contrib/netcdf/netcdf-c-4.6.2/liblib/nc_initialize.c:91
#5 0x00007fffe8149d2d in NC_create (path0=0x7fffffff8d00 "input_out.e", cmode=768, initialsz=0, basepe=0, chunksizehintp=0x0,
useparallel=0, parameters=0x0, ncidp=0x7fffffff890c) at ../../../../../contrib/netcdf/netcdf-c-4.6.2/libdispatch/dfile.c:2036
#6 0x00007fffe81492f2 in nc__create (path=0x7fffffff8d00 "input_out.e", cmode=768, initialsz=0, chunksizehintp=0x0, ncidp=0x7fffffff890c)
at ../../../../../contrib/netcdf/netcdf-c-4.6.2/libdispatch/dfile.c:629
#7 0x00007fffe81492ab in nc_create (path=0x7fffffff8d00 "input_out.e", cmode=768, ncidp=0x7fffffff890c)
at ../../../../../contrib/netcdf/netcdf-c-4.6.2/libdispatch/dfile.c:556
#8 0x00007fffeb60acc1 in ex_create_int (path=0x7fffffff8d00 "input_out.e", cmode=8, comp_ws=0x7fffffff8acc, io_ws=0x7fffffff8ac8,
run_version=811) at ../../../../../contrib/exodusii/v8.11/exodus/src/ex_create.c:155
#9 0x00007fffeab76a3c in libMesh::ExodusII_IO_Helper::create (this=0x555556171a70, filename=...) at ../src/mesh/exodusII_io_helper.C:2183
#10 0x00007fffeab3f032 in libMesh::ExodusII_IO::write_nodal_data_common (this=0x5555560c21e0, fname=..., names=..., continuous=true)
at ../src/mesh/exodusII_io.C:2300
#11 0x00007fffeab3c1e3 in libMesh::ExodusII_IO::write_nodal_data (this=0x5555560c21e0, fname=..., soln=..., names=...)
at ../src/mesh/exodusII_io.C:1824
#12 0x00007fffeae62196 in libMesh::MeshOutput<libMesh::MeshBase>::write_equation_systems (this=0x5555560c2220, fname=..., es=...,
system_names=0x0) at ../src/mesh/mesh_output.C:82
#13 0x00007fffeab3d0df in libMesh::ExodusII_IO::write_timestep (this=0x5555560c21e0, fname=..., es=..., timestep=1, time=0,
system_names=0x0) at ../src/mesh/exodusII_io.C:2000
#14 0x00007ffff5cd9945 in Exodus::outputNodalVariables (this=0x555555fd9730)
at /data/behnpa/projects/moose_libmesh_test/framework/src/outputs/Exodus.C:321
#15 0x00007ffff5cbb726 in AdvancedOutput::output (this=0x555555fd9730)
at /data/behnpa/projects/moose_libmesh_test/framework/src/outputs/AdvancedOutput.C:286
#16 0x00007ffff5cda3c4 in Exodus::output (this=0x555555fd9730)
at /data/behnpa/projects/moose_libmesh_test/framework/src/outputs/Exodus.C:454
#17 0x00007ffff5cea0e3 in OversampleOutput::outputStep (this=0x555555fd9730, type=...)
at /data/behnpa/projects/moose_libmesh_test/framework/src/outputs/OversampleOutput.C:100
#18 0x00007ffff5ce79ba in OutputWarehouse::outputStep (this=0x55555594ac10, type=...)
at /data/behnpa/projects/moose_libmesh_test/framework/src/outputs/OutputWarehouse.C:157
#19 0x00007ffff54af5d6 in FEProblemBase::outputStep (this=0x555555d3c7d0, type=...)
at /data/behnpa/projects/moose_libmesh_test/framework/src/problems/FEProblemBase.C:6291
#20 0x00007ffff431ab95 in Transient::preExecute (this=0x555555da1000)
at /data/behnpa/projects/moose_libmesh_test/framework/src/executioners/Transient.C:254
#21 0x00007ffff431ad08 in Transient::execute (this=0x555555da1000)
at /data/behnpa/projects/moose_libmesh_test/framework/src/executioners/Transient.C:283
#22 0x00007ffff46eaa13 in MooseApp::executeExecutioner (this=0x55555594a370)
at /data/behnpa/projects/moose_libmesh_test/framework/src/base/MooseApp.C:1172
#23 0x00007ffff46f1522 in MooseApp::run (this=0x55555594a370) at /data/behnpa/projects/moose_libmesh_test/framework/src/base/MooseApp.C:1554
#24 0x000055555556350d in Moose::main<SolidMechanicsTestApp> (argc=3, argv=0x7fffffffb8a8)
at /data/behnpa/projects/moose_libmesh_test/framework/build/header_symlinks/MooseMain.h:47
#25 0x0000555555562407 in main (argc=3, argv=0x7fffffffb8a8)
at /data/behnpa/projects/moose_libmesh_test/modules/solid_mechanics/src/main.C:17
pining HDF5 to 1.12.1 will be hella fun (not).
IIRC it was a requirement to bump HDF5, when we found it necessary to bump MPICH to 4.x, because we found it necessary to bump MPICH due to wanting to support Python 3.11 =D
Copying the important bits from slack:
1.14.2 appears to be fine, and at https://github.com/HDFGroup/hdf5/issues/4381 I see they fixed the issue with 1.14.3 and the fix made it into 1.14.4. Testing the latest release (1.14.4.3) seems to confirm that for me. So we ought to be able to get away with either a tiny upgrade or a tiny downgrade, no need to back off all the way to 1.12.
If we can't manage any version change, I think I can get a small (the ifdefs and comments will be longer than the code...) workaround in at the libMesh level; just let me know.
I'll see about getting that in with this PR: https://github.com/idaholab/moose/pull/28399
Bug description
A Floating point exception is generated then caught by the FPExceptionGuard as intended. However we dont have the location of the exception even with the necessary breakpoint, which is not as intended
How to reproduce
on a linux machine Use this input
Impact
Cannot simulate in debug, problem for debugging and developing
Discussed in https://github.com/idaholab/moose/discussions/28252