openPMD / openPMD-api

:floppy_disk: C++ & Python API for Scientific I/O
https://openpmd-api.readthedocs.io
GNU Lesser General Public License v3.0
142 stars 51 forks source link

HDF5 files are confusing #1468

Open eschnett opened 1 year ago

eschnett commented 1 year ago

We looked at an HDF5 file that was generated by openPMD and were very confused that there were so many zeros in the file (see https://github.com/eschnett/CarpetX/issues/152), suspecting a bug in our code. The "solution" was that openPMD extends all datasets to cover the whole domain and fills in the undefined regions with zeros.

I don't think that using the value 0 to indicate an undefined value is convenient. This makes the resulting HDF5 files confusing at best. I recommend using H5Pset_fill_value() to use a different value (nan?) to fill undefined points.

eschnett commented 1 year ago

Note that this was a dataset with mesh refinement data so that refined levels contain a lot of zeros. However, even on the coarse grid (that covers the whole domain) we did not write boundary data, and those undefined points were turned into zeros.

franzpoeschel commented 1 year ago

Reference to the mesh refinement PR of the openPMD standard: https://github.com/openPMD/openPMD-standard/pull/252/files

If the implemented file format supports sparse data sets, i.e. through efficient chunking of patches, the refined level must over the previous level in extend and store multiple patches through its chunking mechanism.

File formats that do not support efficient storage a sparesly populated refinement level can store continguous patches on the same level with an additional suffix _<P> where <P> is the number of the (hyperrectangular) patch in the refinement level.

The implementation of mesh refinement is more difficult in file backends whose data representation have no good native support for sparse datasets. Unfortunately, HDF5 is one of them.

openPMD itself does not even specify a fill value (in fact, we specify H5Pset_fill_time(datasetCreationProperty, H5D_FILL_TIME_NEVER)), this is something that HDF5 does on its own.

It should not be too difficult to add an option that would allow users to specify a custom fill value. However, I doubt that it would be really useful for the use case as you would still get n refinement levels that use disk space corresponding to the full domain. (EDIT: I think that chunking can avoid this?) Also, HDF5 has no straightforward way to recover the regions that were actually written.