Open forrestglines opened 3 years ago
Am I correct in saying that Parthenon is setting the chunk size that is the entire size of a block (nx1 x nx2 x nx3)? Could you be hitting the 4GB limit for chunk size?
I think it is setting that chunk size, though I could be wrong... However, I don't think that could be the issue... a deep AMR hierarchy likely has pretty small meshblock sizes.
Sorry for the delay - postdoc application season.
I had some HPC troubles reproducing this error again. I've sorted through those issues and now have a core file. However, the core file is 179GB. I might be able to transfer it directly to an MSU Google Drive to share it that way, but it would take some time to figure it out.
There's also a reproducer in the description of the issue i.e. try:
srun -n 1 /PATH_TO_PARTHENON_BUILD/example/advection/advection-example -i parthinput.deepAMR.txt
with the input file in the issue and it should reproduce the segfault. So far, I've only tried the error on one machine, but on GPUs, CPUs, release and debug builds, HDF5/1.10.7 and HDF5/1.12.0.
```bash ca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address mca_fbtl_posix_pwritev: error in writev:Bad address HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 0: #000: H5FDint.c line 249 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #001: H5FDmpio.c line 1679 in H5FD_mpio_write(): file write failed major: Low-level I/O minor: Write failed #002: H5B.c line 594 in H5B_insert(): unable to insert key major: B-Tree node minor: Unable to initialize object #003: H5B.c line 1095 in H5B__insert_helper(): unable to unprotect child major: B-Tree node minor: Unable to unprotect metadata #004: H5AC.c line 1776 in H5AC_unprotect(): Can't run sync point major: Object cache minor: Unable to flush data from cache #005: H5ACmpio.c line 2177 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed. major: Object cache minor: Can't get value #006: H5ACmpio.c line 1836 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list. major: Object cache minor: Unable to flush data from cache #007: H5ACmpio.c line 1284 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list. major: Object cache minor: Internal error detected #008: H5Cmpio.c line 394 in H5C_apply_candidate_list(): flush candidates failed major: Object cache minor: Unable to flush data from cache #009: H5Cmpio.c line 1243 in H5C__flush_candidate_entries(): flush candidates in ring failed major: Object cache minor: Unable to flush data from cache #010: H5Cmpio.c line 1422 in H5C__flush_candidates_in_ring(): can't flush entry major: Object cache minor: Unable to flush data from cache #011: H5C.c line 6972 in H5C__flush_single_entry(): Can't write image to file major: Object cache minor: Unable to flush data from cache #012: H5Fio.c line 161 in H5F_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #013: H5PB.c line 1032 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #014: H5Faccum.c line 824 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #015: H5FDint.c line 249 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #016: H5FDmpio.c line 1679 in H5FD_mpio_write(): file write failed major: Low-level I/O minor: Write failed #017: H5B.c line 994 in H5B__insert_helper(): can't insert subtree major: B-Tree node minor: Unable to insert object #018: H5B.c line 1095 in H5B__insert_helper(): unable to unprotect child major: B-Tree node minor: Unable to unprotect metadata #019: H5AC.c line 1776 in H5AC_unprotect(): Can't run sync point major: Object cache minor: Unable to flush data from cache #020: H5ACmpio.c line 2177 in H5AC__run_sync_point(): H5AC__rsp__dist_md_write__flush_to_min_clean() failed. major: Object cache minor: Can't get value #021: H5ACmpio.c line 1836 in H5AC__rsp__dist_md_write__flush_to_min_clean(): Can't propagate and apply candidate list. major: Object cache minor: Unable to flush data from cache #022: H5ACmpio.c line 1284 in H5AC__propagate_and_apply_candidate_list(): Can't apply candidate list. major: Object cache minor: Internal error detected #023: H5Cmpio.c line 394 in H5C_apply_candidate_list(): flush candidates failed major: Object cache minor: Unable to flush data from cache #024: H5Cmpio.c line 1243 in H5C__flush_candidate_entries(): flush candidates in ring failed major: Object cache minor: Unable to flush data from cache #025: H5Cmpio.c line 1422 in H5C__flush_candidates_in_ring(): can't flush entry major: Object cache minor: Unable to flush data from cache #026: H5C.c line 6972 in H5C__flush_single_entry(): Can't write image to file major: Object cache minor: Unable to flush data from cache #027: H5Fio.c line 161 in H5F_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #028: H5PB.c line 1032 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #029: H5Faccum.c line 824 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #030: H5FDint.c line 249 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #031: H5FDmpio.c line 1679 in H5FD_mpio_write(): file write failed major: Low-level I/O minor: Write failed terminate called after throwing an instance of 'std::runtime_error' what(): ### PARTHENON ERROR Message: HDF5 failure: `H5Dwrite(gDSet, type, local_space, global_space, plist_xfer, data)`, Code: 0xffffffff File: /mnt/home/glinesfo/code/parthenon/parthenon/src/outputs/parthenon_hdf5.hpp Line number: 131 [skl-152:34625] *** Process received signal *** [skl-152:34625] Signal: Aborted (6) [skl-152:34625] Signal code: (-6) [skl-152:34625] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2adbaa61d630] ```
@brtnfld It is writing all meshblocks at once, but the chunk size is set to one meshblock. Each meshblock here is 32x32x32x1 doubles, but there are 44584 of them. The full data output is 11GB
After some unrelated code changes and fixes to AthenaPK, I'm now getting segfaults while writing HDF5 without compression as well. This is on my forrestglines/cluster_agn_triggering
branch from AthenaPK. I haven't reproduced this in just Parthenon though
This size of grid works:
This size of grid segfaults when writing.
To run, you need this file
I was able to reproduce the problem on skybridge, looking into the issue now.
Which type of MPI was this with, mpich or openmpi? I'm getting failures for MPI_Allgatherv in HDF5.
This was with OpenMPI/4.0.3 and HDF5/1.12.0
Can you try it with 4.1? It worked for me with 4.1, but not 4.0.
Actually, that was with a patch in HDF5, with no patch 4.1 still fails.
Just to provide an update: we confirmed that this is an issue in the HDF5 library when compression is used in parallel. We are actively working on the fix.
That's good to hear. Thanks for the update, @epourmal !
Can you try it with hdf5_1_13_1 (or develop)? It has new optimizations for parallel compression, which includes memory improvements. With the example, I could get it to run on skybridge, but I still needed at least 64 ranks to avoid memory issues, but that was also the case with just chunked datasets (no filters).
Do you have any updates on this issue? Do you know if it has been resolved?
I ran into an error running AthenaPK while testing a deep AMR hierarchy on a large grid with HDF5 compression turned on. After some testing was able to reproduce the error with the advection example, although it required a larger grid. This file will reproduce the segfault with the
advection-example
in Parthenon parthinput.deepAMR.advection.txt Note that the advection example takes a while before it starts writing the first HDF5 file.The code segfaults while writing the HDF5 file on line
parthenon_hdf5.cpp
on a call toHDF5WriteND
. HDF5/1.10.7 and HDF5/1.12.0 both segfault. The error seems to be related to both the size of the data to be written (around 1GB) and the deep AMR structure, since reducing the number of variables in AthenaPK with a certain size mesh could avoid the segfault and using a flat mesh that occupied all available memory did not lead to the segfault.Interestingly, when I run this same problem on 2 nodes instead of just one, the code doesn't segfault, but it does hang after writing only a portion of the dataset (134M) with a lock on the HDF5 file.