grackle-project / grackle

The Grackle chemistry and cooling library for astrophysical simulations and models.
Other
26 stars 50 forks source link

Auto File Management Part 1: Introducing a Datafile Management Tool #235

Open mabruzzo opened 2 weeks ago

mabruzzo commented 2 weeks ago

This PR introduces a datafile management tool. A lot of this information is also stated in the documentation.

Brief Description

Currently, this management tool is only accessible through pygrackle. To invoke it you would use

python -m pygrackle COMMAND [ARGS...]

where COMMAND is usually one of the following:

The ultimate goal is to make it possible for the Grackle library itself, to be able to directly access these files.[^1] I'm happy to talk a little more about what this might look like down below.

Motivation

Why does this tool exist? Datafiles are required by ANY non-trivial program (e.g. a simulation-code or python script) that invokes Grackle.

It is instructive to consider the historic experience of an end-user of one of these programs. To build Grackle, they would typically clone the git repository for Grackle (including the data files). To invoke their program, they would manually specify the path to the downloaded data file. Frankly, this doesn't seem so bad; the manual intervention is a minor inconvenience, at worst. While it would be nice to eliminate the manual intervention, this it doesn't seem it warrants development of a special tool.

Indeed, this is all true. Users who like this workflow can continue using it. However, this manual management of datafiles becomes problematic in any use-case that is marginally more complex. There are 3 considerations worth highlighting:

  1. Portability: Currently, there is no out-of-the-box approach for any program using Grackle configured to run on one computer to run on another machine without manual intervention.

    • If there are differences in how the machines are set up (e.g. where the data files are placed), the paths to the Grackle data file(s) need to be updated. This is relevant if you want to use a Pygrackle script on a different machine or if you want to use a configuration script to rerun a simulation (involving Grackle) on a different machine.

    • This is particularly noteworthy when it comes to automated testing!

      • For example, right now Pygrackle currently assumes that it was installed as an editable installation to run some of the test suite. After this PR, this is no longer a necessary assumption. The introduction of this tool also makes it easier to run the python examples.[^2]

      • the test-suite of Enzo-E is another example where extra book-keeping is required for all test-problems that invoke Grackle.

  2. If the Grackle repository isn't present:

    • This includes the case where a user deletes the repository after installing Grackle. (we don't really care about this case)

    • It is more important to consider the case where users are installing programs that use Grackle without downloading the repository (or, even if the repository is downloaded, it is done so without the user's knowledge). This will become increasingly common as we make Pygrackle easier to install[^3]. This is also plausible for cmake-builds of downstream projects that embed Grackle compilation as part of their build.

  3. Having multiple Grackle Versions Installed: This is going to be increasingly common as we make Pygrackle easier to install. Users have 2 existing options in this case: (i) they maintain separate repositories of data files for each version or (ii) they assume that they can just use the newest version of the data-file repository. The latter option, has historically been true (and will probably continue to be true). But, it could conceivably lead to cases where people could unintentionally use a data-file created for a newer version of grackle. (While this likely won't be a problem, users should probably be explicitly aware that they are doing this on the off-chance that problems do arise).

This tool is a first step to addressing these cases.

Currently the tool just works for Pygrackle. But, as I noted before, my plan is to introduce functionality to let the Grackle C layer take advantage of how this tool organizes data files.

How it works

Fundamentally, the data management system manages a data store. We will return to that in a moment.

Protocol Version

This internal logic has an associated protocol-version, (you can query this via the --version-protocol flag). The logic may change between protocol versions. The protocol version will change very rarely (if it ever changes at all)

Data Directory

This is simply the data directory that includes all grackle data. This path is given by the GRACKLE_DATA_DIR environment variable, if it exists. Otherwise it defaults to the operating-system's recommendation for user-site-data.

This contains several entries including the:

Outside of the user-data directory, users should not modify/create/delete any files within Data Directory (unless the tool instructs them to).

Data Store

This is where we track the data files managed by this system. This holds a directory called object-store and 1 or more "version-directories".

The primary-representation of each file is tracked within the object-store subdirectory.

Each version-directory is named after a Grackle version (NOT a Pygrackle version).

When a program outside of this tool accesses a data-file, they will ONLY access the references in the version-directory that shares its name with the version of Grackle that the program is linked against.

This tool makes use of references and the object-store to effectively deduplicate data. Whenever this tool deletes a "data-file" reference it will also delete the corresponding file from the object-store if it had no other references. I choose to implement references as "hard links" in order to make it easy to determine when a file in object-store has no reference.

Closing Thoughts

I'm totally open to any feedback!

A lot of the complexity here comes from the fact that I made this tool deduplicate files. For reference, we currently ship 25 MB of datafiles with Grackle. That starts to add up if you have 3 or 4 copies of the files floating around (presumably, the number of datafiles that we ship will only increase in the future). If you don't think its necessary, we can drop this functionality.

One thing to think about is the choice of hash-algorithm. Currently we use SHA-1. But maybe we should use SHA-256 instead or something else? (The git developers plan to transition git to SHA-256 since it is cryptographically secure).

One relevant point here is the functionality I plan to add to the grackle library:

[^1]: I actually got quite far and plan to introduce that as a followup PR, but I underestimated just how many lines of code that would take (specifically, writing a C version of the 15 line _get_data_dir function turned into a few hundred lines). [^2]: In the future, I hope to also make it easier to run the code examples. [^3]: For example, once GH-#208 is merged, you will be able to instruct pip to install pygrackle by just specifying the URL of the GitHub repository [^4]: It was only 10 or 12 years after Git was created that the developers started worrying about collisions (and they are primarily concerned with intentional collisions from maclicious actors).