libMesh / libmesh

libMesh github repository
http://libmesh.github.io
GNU Lesser General Public License v2.1
652 stars 286 forks source link

Exodus Can't Write Large Meshes #2065

Open friedmud opened 5 years ago

friedmud commented 5 years ago

Seen this several times now... Exodus simply can't handle meshes over about 200M elements:

MooseMesh::prepare()
 Mesh Information:
  elem_dimensions()={3}
  spatial_dimension()=3
  n_nodes()=259279152
    n_local_nodes()=259279152
  n_elem()=253736832
    n_local_elem()=253736832
    n_active_elem()=253736832
  n_subdomains()=11
  n_partitions()=1
  n_processors()=1
  n_threads()=1
  processor_id()=0
Error writing element blocks.
Stack frames: 13
0: libMesh::print_trace(std::ostream&)
1: libMesh::MacroFunctions::report_error(char const*, int, char const*, char const*)
2: libMesh::ExodusII_IO_Helper::write_elements(libMesh::MeshBase const&, bool)
3: libMesh::ExodusII_IO::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
4: MeshOnlyAction::act()
5: Action::timedAct()
6: ActionWarehouse::executeActionsWithAction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
7: ActionWarehouse::executeAllActions()
8: MooseApp::runInputFile()
9: MooseApp::run()
10: /home/gastdr/projects/lemhi_new/moose/test/moose_test-opt() [0x402297]
11: __libc_start_main
12: /home/gastdr/projects/lemhi_new/moose/test/moose_test-opt() [0x40250c]
[0] ../src/mesh/exodusII_io_helper.C, line 1417, compiled Mar  6 2019 at 09:36:38
Error closing Exodus file.

No idea what the problem is - but it's a serious bummer.

Of course, I'm not trying to RUN with these meshes... they are just intermediaries before being split. The workaround for now is to use XDR instead... but writing XDR meshes that large is prohibitively slow... so there aren't really any good solutions!

pbauman commented 5 years ago

Forgive my ignorance here, but I thought Nemesis was the parallel version of Exodus that is meant for large meshes?

friedmud commented 5 years ago

It is - but all of our mesh generation routines are serial only. These Exodus files happen at a certain step in our mesh generation process... then we take them and split them into Nemesis.

It is technically possible to generate serially on each processor - then directly split (without the Exodus intermediary). But the mesh generation itself can take several hours... and I might no know how many different numbers of processors I want to run on - so it's useful to have the Exodus file output so if I need a new splitting I can easily do that.

jwpeterson commented 5 years ago

One thing that might be relatively easy to try is building libmesh with HDF5 support. Then Exodus will write files in the NetCDF4 format, which should be better at handling larger filesizes.

If that doesn't work... there may be a bug in the writer itself that is only exposed by really large meshes :cry:

friedmud commented 5 years ago

I'm pretty sure the problem is on our end. Two main things:

  1. We should be passing 64bit flags to Exodus when the library is configured with dof_id_type = 8bytes: https://gsjaardema.github.io/seacas/html/index.html#int64

  2. There are TONS of ints running around in exodusII_io_helper! That's not going to help at all! All of those should be turned into dof_id_type....

jwpeterson commented 5 years ago

Our Exodus files are written in either 64bit-offset mode or NetCDF4 (run ncdump -k to see the type). But you shouldn’t be running into int limits unless you have 2 billion nodes... in your last email it was 200 million. I agree that we should update the Exodus writer at some point... we have a much older version (5.22) at the moment so I don’t think it has the 64-bit API that you linked.

On Mar 19, 2019, at 4:49 PM, Derek Gaston notifications@github.com wrote:

I'm pretty sure the problem is on our end. Two main things:

We should be passing 64bit flags to Exodus when the library is configured with dof_id_type = 8bytes: https://gsjaardema.github.io/seacas/html/index.html#int64

There are TONS of ints running around in exodusII_io_helper! That's not going to help at all! All of those should be turned into dof_id_type....

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

friedmud commented 5 years ago

Hmmm - not exactly. Check out the info here:

https://gsjaardema.github.io/seacas/html/exodus_formats.html

In 64-bit offset mode it runs into trouble at 134M elements: when writing the connectivity.

BTW: I tried to pass:

EX_ALL_INT64_DB | EX_ALL_INT64_API

as flags - and it didn't complain during the writing - but it segfaults on the reading. I guess that even if our Exodus API supported those flags... then we would have to change our reading routines to use 64bit arrays instead of 32bit...

jwpeterson commented 5 years ago

Hmmm - not exactly. Check out the info here: https://gsjaardema.github.io/seacas/html/exodus_formats.html In 64-bit offset mode it runs into trouble at 134M elements: when writing the connectivity.

Hmm, OK, these numbers do make sense given the 4 GiB (2^30 bytes) limit for any single dataset in the file. The 134M number is specific to HEX8s, it gets worse (39.7M elements max) if you are using HEX27s. So, the limiting factor for the "Large Model (64-bit offset)" file format is never going to be numeric_limits<int>::max() (typically 2^31)... it's going to be the numbers above.

If you write in the "Netcdf-4 Non-Classic" format (which is now our default if HDF5 is available), then numeric_limits<int>::max() is going to be the limiting factor, but the current implementation should still allow you to have up to 2^31 nodes and 2^31 elements. Storing just the connectivity for that many HEX8s would require 2^31 4 bytes 8 = 64 Gib!

luiz-bn commented 1 month ago

I'm wondering if there have been developments regarding issue #2065.

I'm running OpenMC simulations and it has been crashing when requesting libMesh to write an ExodusII/Nemesis output.

I'm using Ubuntu 22.04.4 LTS with a compiled version of OpenMC 0.15.0 with libMesh. I’ve tried pointing OpenMC to the libMesh from MOOSE, and also building libMesh from scratch. Both cases give the same error.

Below is the output from OpenMC built with MOOSE’s libMesh while running this notebook:

(...)
       99/1    0.23078    0.23149 +/- 0.00072
      100/1    0.23166    0.23150 +/- 0.00072
 Creating state point statepoint.100.h5...
 Writing file: tally_1.100.e for unstructured mesh 1
libMesh terminating:
Error creating ExodusII/Nemesis mesh file.
Stack frames: 16
0: libMesh::print_trace(std::ostream&)
1: libMesh::MacroFunctions::report_error(char const*, int, char const*, char const*, std::ostream&)
2: libMesh::ExodusII_IO_Helper::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
3: libMesh::ExodusII_IO::write_nodal_data_common(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool)
4: libMesh::ExodusII_IO::write_nodal_data_discontinuous(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<double, std::allocator<double> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)
5: libMesh::ExodusII_IO::write_discontinuous_exodusII(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, libMesh::EquationSystems const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*)
6: openmc::LibMesh::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const
7: openmc::write_unstructured_mesh_results()
8: openmc_statepoint_write
9: openmc::finalize_batch()
10: openmc_next_batch
11: openmc_run
12: main
13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f0b2d6b9d90]
14: __libc_start_main
15: openmc(+0xce35) [0x56249e8b9e35]
[0] ../src/mesh/exodusII_io_helper.C, line 2185, compiled Sep 10 2024 at 09:42:37

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

I’d appreciate any help. Thanks Luiz

jwpeterson commented 1 month ago

Hello @luiz-bn,

How large is the mesh you are trying to write? Do you have libmesh compiled with HDF5 support?

I'm not sure if the error message you are reporting,

Error creating ExodusII/Nemesis mesh file.

is the same as the original error reported on this Issue, which occurred while writing the element blocks, not while just creating the file.