Auto File Management Part 2: Allow Grackle to search for automatically managed data files

This is a followup to PR #235 (it includes all the changes from #235 and should be reviewed afterwards)

Overview

PR #235 introduced a command-line-tool interface that is integrated within pygrackle that is responsible for managing Grackle's data files in a standardized location (in a way that is compatible with having multiple Grackle versions installed).

This PR makes it possible for the Grackle library to automatically lookup a data file from this standard location, without specifying the full path. At the moment, this functionality is most useful when used with pygrackle. (A followup PR will make it easier to use this new functionality when pygrackle isn't installed). This new functionality is enabled by the new grackle_data_file_options parameter (by default, this behavior is disabled).

How this automatic lookup works

In this section, we discuss how this feature works when it is enabled (more on how to do that in the next section).

When this feature is enabled, the value of grackle_data_file is not treated as a path. Instead it should exactly specify the name of one of the data files shipped with Grackle (e.g. "CloudyData_UVB=FG2011.h5", "CloudyData_UVB=HM2012.h5", "cloudy_metals_2008_3D.h5").

When you invoke initialize_chemistry_data (and this functionality is enabled) grackle invokes the following search procedure:

First, it checks whether the name of the datafile exactly matches one of the standard data files shipped with the current version of Grackle.
- The list of filenames is automatically encoded in the c library at compile-time based on the list of specified in the file_registry.txt file that was introduced in #235.
- If the string specified by grackle_data_file does not EXACTLY match a known file, then an error is reported. For safety reasons, if the user specifies a path to a data file, we reject it (e.g. "CloudyData_UVB=FG2011.h5" is ok but "path/to/CloudyData_UVB=FG2011.h5" is NOT).
Next, we determine the standard location where the datafiles should be stored. The C function that does determines this location encodes the same logic as the corresponding python function that is used to manage the datafiles.
Finally we construct the path to the file and ensure that the file has the expected contents
- my big fear while implementing this is that I would make a mistake in some logic (either the python logic that manages the datafiles or the logic for finding the datafiles) and we would have grackle silently use the wrong datafile (invalidating users' results).
- as insurance, we validate the file's known checksum. Earlier we mentioned that we encoded the known filenames directly into the C library. At the same time, we also encode the known checksum (which is also listed in the aforementioned file_registry.txt file).
- to actually compute the checksum we use the functions provided by the open-source picohash c library. Since this "library" is just a single header-file, we actually ship it as a part of Grackle.[^1] Whether the CMake build system or classic build system is used, the functionality is included into grackle without any extra steps.

How to enable automatic lookup

To enable/disable this feature, you need to assign grackle_data_file_options a constant-value encoded by one of the following macros:

GR_DFOPT_FULLPATH_NO_CKSUM: In this case we assume that grackle_data_file encodes the full path to a file. When no value is provided, we default to this case. This is the classic behavior
GR_DFOPT_MANAGED: this unlocks the new functionality described in this PR. In the unlikely event that different grackle versions ship different versions of a datafile, we will always load the standard datafile contemporaneous with the current version of grackle.
GR_DFOPT_MANAGED_NO_CKSUM: does the same thing as the former case, but doesn't do any checksum calculation and validation. This is provided in case the user is working on a "fragile" parallel filesystem (like the one on frontera) and wants to minimize the file system operations for some of their MPI processes[^2]

In pygrackle, these values are accessed through the new constants object. For example,

[^1]: I'm somewhat tempted to use this alternative library. To do that we would need to change the checksums from SHA-1 to SHA-256. But I think that would be fine. We also discuss making this change, for separate reasons in #235.

[^2]: If the user decides to use this on all MPI ranks in place of GR_DFOPT_STANDARD_CONTEMPORANEOUS, then they are accepting any risk associated with (hypothetical) bugs that could lead to reading the wrong file. (This is unlikely, but in this scenario, the blame entirely lies with the user).

grackle-project / grackle