grackle-project / grackle

The Grackle chemistry and cooling library for astrophysical simulations and models.
Other
26 stars 50 forks source link

State-Dumping Debug Tool #197

Open mabruzzo opened 6 months ago

mabruzzo commented 6 months ago

To make this PR easier to review, I split off some of the changes into #205. That should be reviewed first.

Description

This PR introduces "tooling" to help with Diagnosing and Debugging Grackle errors.

Motivation

One of Grackle's greatest weaknesses is that it's really hard to diagnose and debug errors:

This PR introduces "tooling" to help address this challenge. We primarily introduce a C function to dump the internal state of the hdf5 file. Then we introduced convenience python functions (that we need anyway to test this functionality) to reproduce grackle's state from the hdf5 file and make it easy to try to use that to reproduce the issue.

This makes it easy for users to communicate enough information to get help and for the people who are actually trying to debug the problem.

Detailed Description

In this section I will provide some more details about the changes. See the website docs for a more detailed description of what they do and how to use them.

Essentially this PR seeks to introduce a new function to assist with debugging:

int grunstable_h5dump_state(const char* fname, long long dest_hid,
                            const chemistry_data* chemistry_data,
                            const code_units* initial_code_units,
                            const code_units* current_code_units,
                            const grackle_field_data* my_fields);

The idea here is simple: when you are working on reproducing a grackle-error, you can use this to dump grackle's state to an hdf5 file just before the error occurs. Specifically, the data is dumped to a newly created HDF5 (at the path specified by fname) OR within an existing hdf5 file (at a location specified by dest_hid).

To properly dump the state of grackle_field_data, we need people to initialize unused fields to NULL. This is accomplished by a new function called:

int gr_initialize_field_data(grackle_field_data *my_fields);

The idea is to have users immediately call this function right after they initialize a new grackle_field_data instance, analogous to the way that local_initialize_chemistry_parameters is used.[^1]

I also added some python functions to help load in the dumps and use them for debugging

Stability Concerns

I'm very confident about the internal changes to Grackle. They may seem a little complicated, but I think they are robust and provide a lot of flexibility for changing things in the future.

With that said, I have some concerns about stabilizing 2 aspects of this PR. In particular I'm worried about:

If we needed to guarantee stability, I'm not sure I would ever feel confident proposing these changes. Furthermore, I think these changes would be useful to people, right now, even without a stability guarantee (people don't need to call grunstable_h5dump_state in the mainline version of their simulation code -- there is no need to create these dumps outside of debugging purposes).

Thus, my proposed compromise is to integrate these things into Grackle as unstable features. I've tried to communicate that in the docs.

To make it extra clear that grunstable_h5dump_state isn't stable, I have:

Other changes

This includes a few other changes:

Future Ideas

It might be nice to ship Grackle with a simple C/C++ program to help with diagnosing issues. More details are provided below the fold.

More about a C/C++ diagnostic tool In the simplest form, the program - would load in these dumps and then executes one of the grackle-calculation functions - would need to provide a way to override the `grackle_data_file` variable - if we were willing to give this program access to Grackle's internals, only a few minor tweaks would be necessary to easily reconstruct Grackle's state from the hdf5 dumps (this approach would be extremely maintainable as we add new fields and grackle-parameters) Currently, the only way to load in the internal state is with the new python functions. This C/C++ program would be a useful addition because: - it would make reproducing errors more convenient - If you successfully compile grackle, then you basically get this for free - In comparison, it can be a little tricky to install pygrackle (hopefully it gets easier). Plus, a user may not want to install pygrackle on a cluster-environment - we could more easily reproduce all the conditions where the error first occurred - currently pygrackle doesn't support all features of regular grackle (hopefully this gets better) - if we start distributing pygrackle on PYPI or conda-forge, there is a much greater likelihood that people will build their simulation and pygrackle with different grackle versions - as we start adding gpu support, I imagine that getting python to play nicely with the extended libraries could be tricky... - it would be easier to invoke a debugger session with something like `gdb` There are also some fairly cool things you could extend this tool to do: - you could add in the ability to load in a small slice of the data-dump - you could support dumping what was loaded to a new file (this could help with producing more rigorous tests of the debugging functionality AND could help cut out field data where there isn't any problem) - you could support a mode that searches index-by-index for the problematic data - you could also use this thing for performance benchmarking (since we could easily feed it arbitrary data and don't have to worry about of a python interpreter)

[^1]: The particular change can be separately reviewed as part of PR #205. [^2]: In particular, if people serialize just the chemistry_data contents in a standardized way, we could support some cool things with yt analysis (building on existing functionality in pygrackle, or even adding some basic things directly to yt without needing pygrackle) [^3]: If I'm feeling motivated, I might introduce a separate pull-request to try to introduce some of this functionality (but I probably won't get to it).

mabruzzo commented 5 months ago

@brittonsmith, When you have time this is now ready for review.

Sorry this got so large.

I think this could create some minor conflicts with #177, but I'm happy to help resolve those (it should be very easy)

brittonsmith commented 3 months ago

@mabruzzo, I will try to review this when I am back from holiday in a couple weeks. If you could resolve the current merger conflicts before then, that would be great.

mabruzzo commented 3 months ago

I haven't quite gotten around to this yet (hopefully tomorrow or the day after)! There are a number of structural improvements that could be made. In particular #209 will simplify the API a lot!