Open mabruzzo opened 6 months ago
@brittonsmith, When you have time this is now ready for review.
Sorry this got so large.
I think this could create some minor conflicts with #177, but I'm happy to help resolve those (it should be very easy)
@mabruzzo, I will try to review this when I am back from holiday in a couple weeks. If you could resolve the current merger conflicts before then, that would be great.
I haven't quite gotten around to this yet (hopefully tomorrow or the day after)! There are a number of structural improvements that could be made. In particular #209 will simplify the API a lot!
To make this PR easier to review, I split off some of the changes into #205. That should be reviewed first.
Description
This PR introduces "tooling" to help with Diagnosing and Debugging Grackle errors.
Motivation
One of Grackle's greatest weaknesses is that it's really hard to diagnose and debug errors:
This PR introduces "tooling" to help address this challenge. We primarily introduce a C function to dump the internal state of the hdf5 file. Then we introduced convenience python functions (that we need anyway to test this functionality) to reproduce grackle's state from the hdf5 file and make it easy to try to use that to reproduce the issue.
This makes it easy for users to communicate enough information to get help and for the people who are actually trying to debug the problem.
Detailed Description
In this section I will provide some more details about the changes. See the website docs for a more detailed description of what they do and how to use them.
Essentially this PR seeks to introduce a new function to assist with debugging:
The idea here is simple: when you are working on reproducing a grackle-error, you can use this to dump grackle's state to an hdf5 file just before the error occurs. Specifically, the data is dumped to a newly created HDF5 (at the path specified by
fname
) OR within an existing hdf5 file (at a location specified bydest_hid
).To properly dump the state of
grackle_field_data
, we need people to initialize unused fields toNULL
. This is accomplished by a new function called:The idea is to have users immediately call this function right after they initialize a new
grackle_field_data
instance, analogous to the way thatlocal_initialize_chemistry_parameters
is used.[^1]I also added some python functions to help load in the dumps and use them for debugging
Stability Concerns
I'm very confident about the internal changes to Grackle. They may seem a little complicated, but I think they are robust and provide a lot of flexibility for changing things in the future.
With that said, I have some concerns about stabilizing 2 aspects of this PR. In particular I'm worried about:
grunstable_h5dump_state
.code_units
state) and could definitely be improved. For example, maybe we have people create a dump right after initialization and then dump again when they think the problem occurs. Or, if we introduce on the changes suggested in #198, this could be simplified a lot more.If we needed to guarantee stability, I'm not sure I would ever feel confident proposing these changes. Furthermore, I think these changes would be useful to people, right now, even without a stability guarantee (people don't need to call
grunstable_h5dump_state
in the mainline version of their simulation code -- there is no need to create these dumps outside of debugging purposes).Thus, my proposed compromise is to integrate these things into Grackle as unstable features. I've tried to communicate that in the docs.
To make it extra clear that
grunstable_h5dump_state
isn't stable, I have:grackle_unstable.h
public-header (a downstream application would need to explicitly include this header in addition to the regulargrackle.h
header189 proposes that we adopt a standard convention that all new functions in the stable public API share a common prefix (
gr_
orgrackle_
)grunstable_
under the assumption that we'll start prefixing all public components of the stable-public API withgr_
grunstable_
prefix would need to changeOther changes
This includes a few other changes:
chemistry_data
,code_units
, andgrackle_version
structs to stdout (when running grackle in VERBOSE mode), have been refactored to use similar functionality to the hdf5-dumping machineryFuture Ideas
It might be nice to ship Grackle with a simple C/C++ program to help with diagnosing issues. More details are provided below the fold.
More about a C/C++ diagnostic tool
In the simplest form, the program - would load in these dumps and then executes one of the grackle-calculation functions - would need to provide a way to override the `grackle_data_file` variable - if we were willing to give this program access to Grackle's internals, only a few minor tweaks would be necessary to easily reconstruct Grackle's state from the hdf5 dumps (this approach would be extremely maintainable as we add new fields and grackle-parameters) Currently, the only way to load in the internal state is with the new python functions. This C/C++ program would be a useful addition because: - it would make reproducing errors more convenient - If you successfully compile grackle, then you basically get this for free - In comparison, it can be a little tricky to install pygrackle (hopefully it gets easier). Plus, a user may not want to install pygrackle on a cluster-environment - we could more easily reproduce all the conditions where the error first occurred - currently pygrackle doesn't support all features of regular grackle (hopefully this gets better) - if we start distributing pygrackle on PYPI or conda-forge, there is a much greater likelihood that people will build their simulation and pygrackle with different grackle versions - as we start adding gpu support, I imagine that getting python to play nicely with the extended libraries could be tricky... - it would be easier to invoke a debugger session with something like `gdb` There are also some fairly cool things you could extend this tool to do: - you could add in the ability to load in a small slice of the data-dump - you could support dumping what was loaded to a new file (this could help with producing more rigorous tests of the debugging functionality AND could help cut out field data where there isn't any problem) - you could support a mode that searches index-by-index for the problematic data - you could also use this thing for performance benchmarking (since we could easily feed it arbitrary data and don't have to worry about of a python interpreter)[^1]: The particular change can be separately reviewed as part of PR #205. [^2]: In particular, if people serialize just the
chemistry_data
contents in a standardized way, we could support some cool things withyt
analysis (building on existing functionality in pygrackle, or even adding some basic things directly toyt
without needingpygrackle
) [^3]: If I'm feeling motivated, I might introduce a separate pull-request to try to introduce some of this functionality (but I probably won't get to it).