parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
112 stars 35 forks source link

Segfault Writting HDF5 with Large Deep AMR #589

Open forrestglines opened 3 years ago

forrestglines commented 3 years ago

I ran into an error running AthenaPK while testing a deep AMR hierarchy on a large grid with HDF5 compression turned on. After some testing was able to reproduce the error with the advection example, although it required a larger grid. This file will reproduce the segfault with the advection-example in Parthenon parthinput.deepAMR.advection.txt Note that the advection example takes a while before it starts writing the first HDF5 file.

The code segfaults while writing the HDF5 file on line parthenon_hdf5.cpp on a call to HDF5WriteND. HDF5/1.10.7 and HDF5/1.12.0 both segfault. The error seems to be related to both the size of the data to be written (around 1GB) and the deep AMR structure, since reducing the number of variables in AthenaPK with a certain size mesh could avoid the segfault and using a flat mesh that occupied all available memory did not lead to the segfault.

Interestingly, when I run this same problem on 2 nodes instead of just one, the code doesn't segfault, but it does hang after writing only a portion of the dataset (134M) with a lock on the HDF5 file.

brtnfld commented 3 years ago

Am I correct in saying that Parthenon is setting the chunk size that is the entire size of a block (nx1 x nx2 x nx3)? Could you be hitting the 4GB limit for chunk size?

Yurlungur commented 3 years ago

I think it is setting that chunk size, though I could be wrong... However, I don't think that could be the issue... a deep AMR hierarchy likely has pretty small meshblock sizes.

forrestglines commented 3 years ago

Sorry for the delay - postdoc application season.

I had some HPC troubles reproducing this error again. I've sorted through those issues and now have a core file. However, the core file is 179GB. I might be able to transfer it directly to an MSU Google Drive to share it that way, but it would take some time to figure it out.

There's also a reproducer in the description of the issue i.e. try:

srun -n 1 /PATH_TO_PARTHENON_BUILD/example/advection/advection-example -i parthinput.deepAMR.txt

with the input file in the issue and it should reproduce the segfault. So far, I've only tried the error on one machine, but on GPUs, CPUs, release and debug builds, HDF5/1.10.7 and HDF5/1.12.0.

Currently I get this error writing the first output

```bash ca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 0: #000: H5FDint.c line 249 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #001: H5FDmpio.c line 1679 in H5FD_mpio_write(): file write failed major: Low-level I/O minor: Write failed #002: H5B.c line 594 in H5B_insert(): unable to insert key major: B-Tree node minor: Unable to initialize object #003: H5B.c line 1095 in H5B__insert_helper(): unable to unprotect child major: B-Tree node minor: Unable to unprotect metadata #004: H5AC.c line 1776 in H5AC_unprotect(): Can't run sync point major: Object cache minor: Unable to flush data from cache #005: H5ACmpio.c line 2177 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed. major: Object cache minor: Can't get value #006: H5ACmpio.c line 1836 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list. major: Object cache minor: Unable to flush data from cache #007: H5ACmpio.c line 1284 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list. major: Object cache minor: Internal error detected #008: H5Cmpio.c line 394 in H5C_apply_candidate_list(): flush candidates failed major: Object cache minor: Unable to flush data from cache #009: H5Cmpio.c line 1243 in H5C__flush_candidate_entries(): flush candidates in ring failed major: Object cache minor: Unable to flush data from cache #010: H5Cmpio.c line 1422 in H5C__flush_candidates_in_ring(): can't flush entry major: Object cache minor: Unable to flush data from cache #011: H5C.c line 6972 in H5C__flush_single_entry(): Can't write image to file major: Object cache minor: Unable to flush data from cache #012: H5Fio.c line 161 in H5F_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #013: H5PB.c line 1032 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #014: H5Faccum.c line 824 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #015: H5FDint.c line 249 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #016: H5FDmpio.c line 1679 in H5FD_mpio_write(): file write failed major: Low-level I/O minor: Write failed #017: H5B.c line 994 in H5B__insert_helper(): can't insert subtree major: B-Tree node minor: Unable to insert object #018: H5B.c line 1095 in H5B__insert_helper(): unable to unprotect child major: B-Tree node minor: Unable to unprotect metadata #019: H5AC.c line 1776 in H5AC_unprotect(): Can't run sync point major: Object cache minor: Unable to flush data from cache #020: H5ACmpio.c line 2177 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed. major: Object cache minor: Can't get value #021: H5ACmpio.c line 1836 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list. major: Object cache minor: Unable to flush data from cache #022: H5ACmpio.c line 1284 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list. major: Object cache minor: Internal error detected #023: H5Cmpio.c line 394 in H5C_apply_candidate_list(): flush candidates failed major: Object cache minor: Unable to flush data from cache #024: H5Cmpio.c line 1243 in H5C__flush_candidate_entries(): flush candidates in ring failed major: Object cache minor: Unable to flush data from cache #025: H5Cmpio.c line 1422 in H5C__flush_candidates_in_ring(): can't flush entry major: Object cache minor: Unable to flush data from cache #026: H5C.c line 6972 in H5C__flush_single_entry(): Can't write image to file major: Object cache minor: Unable to flush data from cache #027: H5Fio.c line 161 in H5F_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #028: H5PB.c line 1032 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #029: H5Faccum.c line 824 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #030: H5FDint.c line 249 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #031: H5FDmpio.c line 1679 in H5FD_mpio_write(): file write failed major: Low-level I/O minor: Write failed terminate called after throwing an instance of 'std::runtime_error' what(): ### PARTHENON ERROR Message: HDF5 failure: `H5Dwrite(gDSet, type, local_space, global_space, plist_xfer, data)`, Code: 0xffffffff File: /mnt/home/glinesfo/code/parthenon/parthenon/src/outputs/parthenon_hdf5.hpp Line number: 131 [skl-152:34625] *** Process received signal *** [skl-152:34625] Signal: Aborted (6) [skl-152:34625] Signal code: (-6) [skl-152:34625] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2adbaa61d630] ```

@brtnfld It is writing all meshblocks at once, but the chunk size is set to one meshblock. Each meshblock here is 32x32x32x1 doubles, but there are 44584 of them. The full data output is 11GB

forrestglines commented 3 years ago

After some unrelated code changes and fixes to AthenaPK, I'm now getting segfaults while writing HDF5 without compression as well. This is on my forrestglines/cluster_agn_triggering branch from AthenaPK. I haven't reproduced this in just Parthenon though

This size of grid works:

cluster.medium.input.txt

This size of grid segfaults when writing.

cluster.large.input.txt

To run, you need this file

schure.cooling.txt

brtnfld commented 2 years ago

I was able to reproduce the problem on skybridge, looking into the issue now.

brtnfld commented 2 years ago

Which type of MPI was this with, mpich or openmpi? I'm getting failures for MPI_Allgatherv in HDF5.

forrestglines commented 2 years ago

This was with OpenMPI/4.0.3 and HDF5/1.12.0

brtnfld commented 2 years ago

Can you try it with 4.1? It worked for me with 4.1, but not 4.0.

brtnfld commented 2 years ago

Actually, that was with a patch in HDF5, with no patch 4.1 still fails.

epourmal commented 2 years ago

Just to provide an update: we confirmed that this is an issue in the HDF5 library when compression is used in parallel. We are actively working on the fix.

Yurlungur commented 2 years ago

That's good to hear. Thanks for the update, @epourmal !

brtnfld commented 2 years ago

Can you try it with hdf5_1_13_1 (or develop)? It has new optimizations for parallel compression, which includes memory improvements. With the example, I could get it to run on skybridge, but I still needed at least 64 ranks to avoid memory issues, but that was also the case with just chunked datasets (no filters).

brtnfld commented 1 year ago

Do you have any updates on this issue? Do you know if it has been resolved?