Auto File Management Part 1: Introducing a Datafile Management Tool

This PR introduces a datafile management tool. A lot of this information is also stated in the documentation.

Brief Description

Currently, this management tool is only accessible through pygrackle. To invoke it you would use

python -m pygrackle COMMAND [ARGS...]

where COMMAND is usually one of the following:

fetch fetch all data files
ls-versions list the grackle versions for which we have local copies of the data files
rm can be used to remove the data-files associated with a version (or all data files)
getpath lists the path to the directory containing all of the grackle data
calcreg calculates an updated registry of file names and checksums to be used in future versions of this tool (basically, you would call it anytime you change the data-files)

The ultimate goal is to make it possible for the Grackle library itself, to be able to directly access these files.[^1] I'm happy to talk a little more about what this might look like down below.

Motivation

Why does this tool exist? Datafiles are required by ANY non-trivial program (e.g. a simulation-code or python script) that invokes Grackle.

It is instructive to consider the historic experience of an end-user of one of these programs. To build Grackle, they would typically clone the git repository for Grackle (including the data files). To invoke their program, they would manually specify the path to the downloaded data file. Frankly, this doesn't seem so bad; the manual intervention is a minor inconvenience, at worst. While it would be nice to eliminate the manual intervention, this it doesn't seem it warrants development of a special tool.

Indeed, this is all true. Users who like this workflow can continue using it. However, this manual management of datafiles becomes problematic in any use-case that is marginally more complex. There are 3 considerations worth highlighting:

Portability: Currently, there is no out-of-the-box approach for any program using Grackle configured to run on one computer to run on another machine without manual intervention.
- If there are differences in how the machines are set up (e.g. where the data files are placed), the paths to the Grackle data file(s) need to be updated. This is relevant if you want to use a Pygrackle script on a different machine or if you want to use a configuration script to rerun a simulation (involving Grackle) on a different machine.
- This is particularly noteworthy when it comes to automated testing!
  - For example, right now Pygrackle currently assumes that it was installed as an editable installation to run some of the test suite. After this PR, this is no longer a necessary assumption. The introduction of this tool also makes it easier to run the python examples.[^2]
  - the test-suite of Enzo-E is another example where extra book-keeping is required for all test-problems that invoke Grackle.
If the Grackle repository isn't present:
- This includes the case where a user deletes the repository after installing Grackle. (we don't really care about this case)
- It is more important to consider the case where users are installing programs that use Grackle without downloading the repository (or, even if the repository is downloaded, it is done so without the user's knowledge). This will become increasingly common as we make Pygrackle easier to install[^3]. This is also plausible for cmake-builds of downstream projects that embed Grackle compilation as part of their build.
Having multiple Grackle Versions Installed: This is going to be increasingly common as we make Pygrackle easier to install. Users have 2 existing options in this case: (i) they maintain separate repositories of data files for each version or (ii) they assume that they can just use the newest version of the data-file repository. The latter option, has historically been true (and will probably continue to be true). But, it could conceivably lead to cases where people could unintentionally use a data-file created for a newer version of grackle. (While this likely won't be a problem, users should probably be explicitly aware that they are doing this on the off-chance that problems do arise).

This tool is a first step to addressing these cases.

Currently the tool just works for Pygrackle. But, as I noted before, my plan is to introduce functionality to let the Grackle C layer take advantage of how this tool organizes data files.

How it works

Fundamentally, the data management system manages a data store. We will return to that in a moment.

Protocol Version

This internal logic has an associated protocol-version, (you can query this via the --version-protocol flag). The logic may change between protocol versions. The protocol version will change very rarely (if it ever changes at all)

Data Directory

This is simply the data directory that includes all grackle data. This path is given by the GRACKLE_DATA_DIR environment variable, if it exists. Otherwise it defaults to the operating-system's recommendation for user-site-data.

This contains several entries including the:

a user-data directory. This directory currently isn't used yet, but it is reserved for users to put custom data-files in the future.
a tmp directory (used by the data-management tool)
it sometimes holds a lockfile (used to ensure that multiple instances of this tool aren't running at once)
the data store directory(ies). This is named data-store-v<PROTOCOL-VERSION> so that earlier versions of this tool will continue to function if we ever change the protocol. (Each of these directories are completely independent of each other).

Outside of the user-data directory, users should not modify/create/delete any files within Data Directory (unless the tool instructs them to).

Data Store

This is where we track the data files managed by this system. This holds a directory called object-store and 1 or more "version-directories".

The primary-representation of each file is tracked within the object-store subdirectory.

The name of each item in this directory is a unique key. This key is the file’s SHA-1 checksum.
Git internally tracks objects in a very similar way (they have historically used SHA-1 checksums as unique keys). The chance of an accidental collision in the checksum in a large Git repository is extremely tiny.[^4]

Each version-directory is named after a Grackle version (NOT a Pygrackle version).

a given version directory holds data-file references.
the references have the contemporaneous name of each of the data-files that was shipped with the Grackle-version that corresponds to the directory's name.
each reference is linked to the corresponding file in the object-store.

When a program outside of this tool accesses a data-file, they will ONLY access the references in the version-directory that shares its name with the version of Grackle that the program is linked against.

This tool makes use of references and the object-store to effectively deduplicate data. Whenever this tool deletes a "data-file" reference it will also delete the corresponding file from the object-store if it had no other references. I choose to implement references as "hard links" in order to make it easy to determine when a file in object-store has no reference.

Closing Thoughts

I'm totally open to any feedback!

A lot of the complexity here comes from the fact that I made this tool deduplicate files. For reference, we currently ship 25 MB of datafiles with Grackle. That starts to add up if you have 3 or 4 copies of the files floating around (presumably, the number of datafiles that we ship will only increase in the future). If you don't think its necessary, we can drop this functionality.

One thing to think about is the choice of hash-algorithm. Currently we use SHA-1. But maybe we should use SHA-256 instead or something else? (The git developers plan to transition git to SHA-256 since it is cryptographically secure).

One relevant point here is the functionality I plan to add to the grackle library:

I plan to add a new parameter called data_file_handling.
- The default value will have Grackle interpret grackle_data_file as it always has.
- A different value (maybe 1) will instruct Grackle to search for the file specified by grackle_data_file within the directories managed by this grdata tool.
I toyed around with this idea for a while. And, I was always worried that I could make some mistake that could break somebody's simulations.
- The compromise that I came to is that Grackle will internally track the checksums of the datafiles that it ships with (we will automatically embed this info) and we will only allow people to use the specified file in this special manner if it matches the checksum
- To compute the checksum we would probably use the tiny, portable, public domain library called picohash. Using SHA-256 might be a little slower than SHA-1.

[^1]: I actually got quite far and plan to introduce that as a followup PR, but I underestimated just how many lines of code that would take (specifically, writing a C version of the 15 line _get_data_dir function turned into a few hundred lines). [^2]: In the future, I hope to also make it easier to run the code examples. [^3]: For example, once GH-#208 is merged, you will be able to instruct pip to install pygrackle by just specifying the URL of the GitHub repository [^4]: It was only 10 or 12 years after Git was created that the developers started worrying about collisions (and they are primarily concerned with intentional collisions from maclicious actors).

grackle-project / grackle