Content addressable storage (CAS)

wlandau commented 2 months ago

c.f. #1232

Content addressable storage (CAS) is a type of storage system more amenable to portability and collaboration than what targets currently uses. In CAS, the name of each object is its hash, and there is a mapping from human-friendly target names to these hashes. CAS would allow the actual data to be stored centrally rather than locally, and it would let multiple pipelines leverage each other’s results.

Posit Conf gave me much think about regarding CAS. Many users (many more than I originally thought) would benefit from better native support to write custom third-party CAS systems. I have realized that the approach in https://github.com/ropensci/targets/discussions/1232#discussioncomment-10277319 is difficult for users to implement.

I don’t know exactly how to go forward with this at the moment. However, I can state a few goals of a heavy-handed CAS:

Users should be able to declare how to upload an object, download an object, and check if an object exists. Only the hash of the object should be needed as input.
{targets} should be able to turn this input info into a CAS without needing the “pointer files” in _targets/objects mentioned in #1232.
There should be some kind optional “list” step at the beginning to make existence checking fast (e.g. with a LIST request in the case of AWS S3).

I am not sure the above would actually fit well enough into the design of targets.

For a lightweight CAS, a custom tar_format() could be the vehicle, but with some support that avoids the need for users to micromanage key files and their hashes.

I have not decided on a direction yet.

wlandau commented 2 months ago

FYI @noamross, @jaredlander

wlandau commented 2 months ago

Conceptually, CAS fits better with the “repository” setting than the “format” setting. So maybe there needs to be a new tar_repository() function as an entry point for “repository” just as tar_format() is an entry point for “format”.

noamross commented 2 months ago

This is a bit of a brain dump of ideas in this area. These are far-reaching wishes but they are things that address real issues my team ran into, so I hope a design would allow for these extensions:

One could point to multiple CAS repositories for project, with the system checking for the right hash across them before building locally. For instance, a project might be published with a read-only CAS repository, which anyone can draw from, but as one modifies a forked repository you write stuff to a local or team repository. The repository itself could be an S3 bucket, files on a static site, attached to a GitHub release or in a scientific repository, so one might want to allow different plug-ins for the actual repository type for different transfer protocols, and to designate as read-only.
- It seems that one would need to define read, write, and LIST for each back-end plugin. Maybe delete, too (see below)
Local caching: It would be very useful to be able to run CAS locally and asyncronously or periodically upload to a shared CAS repo, so that uploading isn't a bottleneck. This might be just another multi-repo configuration. We often dealt with a trade-off between long compute and long transfer of large files.
One configuration, which would have been very useful for us at some points, would be to use Dropbox or a similar synced folder as the shared cache. This might be a two-folder multi-repo configuration with one being the "local cache", but it might work with a single folder.
I can see how to do most of the above, but it all becomes more challenging if you are attempting to store hashes of pointer files within the CAS, as suggested at https://github.com/ropensci/targets/discussions/1232#discussioncomment-10277319 . If the storage folder is only hash-named files/blobs, working with any storage back-end is simpler.
Cache clearing: A shared repo can grow indefinitely, so one would want some smart approaches to clearing out old stuff according to rules. For instance, you might set up a rule, "Delete all targets older than X date, unless they are in the meta file of any git branch tip." This is almost definitely a side quest that should be in another package, or part of plugins for storage back-ends. {relic} is where I've started messing with git-history related tasks, and I might put something there.

I'm interested in putting in an ISC proposal to work on this in the fall. Maybe we could use the proposal as a way to work out some design concepts?

wlandau commented 2 months ago

This is a bit of a brain dump of ideas in this area. These are far-reaching wishes but they are things that address real issues my team ran into, so I hope a design would allow for these extensions:

The more I think about it, the more it seems like a potential tar_repository() could enable what you describe. And I think it will fit nicely with the design of targets. It is similar enough to the vision of tar_format(), and there is room to extend repository = "..." the same way.

One could point to multiple CAS repositories for project,

This might be a less common use case, but for what I have in mind, each target would be able to choose a different CAS.

with the system checking for the right hash across them before building locally.

The underlying mechanism of repository = "aws" already works this way, and I think I could borrow it for tar_repository().

For instance, a project might be published with a read-only CAS repository, which anyone can draw from, but as one modifies a forked repository you write stuff to a local or team repository.

tar_make() would always need read access to run without errors, but tar_read() would not.

The repository itself could be an S3 bucket, files on a static site, attached to a GitHub release or in a scientific repository, so one might want to allow different plug-ins for the actual repository type for different transfer protocols, and to designate as read-only. It seems that one would need to define read, write, and LIST for each back-end plugin. Maybe delete, too (see below)

I agree.

Local caching: It would be very useful to be able to run CAS locally and asyncronously or periodically upload to a shared CAS repo, so that uploading isn't a bottleneck. This might be just another multi-repo configuration. We often dealt with a trade-off between long compute and long transfer of large files.

This second upload/sync stage sounds possible, and it would sit completely outside targets. tar_make() could write to the stage 1 local repo, and then something like a cron job could do periodic uploads/syncs.

One configuration, which would have been very useful for us at some points, would be to use Dropbox or a similar synced folder as the shared cache. This might be a two-folder multi-repo configuration with one being the "local cache", but it might work with a single folder. I can see how to do most of the above, but it all becomes more challenging if you are attempting to store hashes of pointer files within the CAS, as suggested at [general] New cloud hashing approach and collaborative workflows #1232 (reply in thread) . If the storage folder is only hash-named files/blobs, working with any storage back-end is simpler.

Good point, I remember struggling with this using storr for drake because its default local cache had the pointer file design. I no longer think we need pointer files. tar_repository() will take more work, but I came away from Posit Conf feeling that a truly first-class pluggable CAS system is exactly what so many users need, and it can take targets to the next level.

Cache clearing: A shared repo can grow indefinitely, so one would want some smart approaches to clearing out old stuff according to rules. For instance, you might set up a rule, "Delete all targets older than X date, unless they are in the meta file of any git branch tip." This is almost definitely a side quest that should be in another package, or part of plugins for storage back-ends. {relic} is where I've started messing with git-history related tasks, and I might put something there.

At a high level, I feel like functions tar_delete(), tar_prune(), and tar_destroy() assume you are working on your own personal project. And for CAS, there's so much historical data that these functions in targets wouldn't really clear out most of the garbage anyway. So it seems like the CAS itself is a good place to handle this, rather than a pluggable DELETE method in tar_repository().

For instance, you might set up a rule, "Delete all targets older than X date, unless they are in the meta file of any git branch tip."

What about by access date? I think this answers the question, "which are the data objects that nobody is using anymore?"

I'm interested in putting in an ISC proposal to work on this in the fall. Maybe we could use the proposal as a way to work out some design concepts?

Happy to take a look and comment on the proposal. I think it would help me make sure I'm not missing anything.

In the meantime, when I next get the chance, I plan to start prototyping tar_repository() and write a tutorial with an example simple local CAS.

noamross commented 2 months ago

What about by access date? I think this answers the question, "which are the data objects that nobody is using anymore?"

A good idea! The availability of this metadata will vary by back-end, which is why your suggestion that the cache-clearing stuff should be in storage plugin and/or version-control extensions rather than part of tar_repository().

noamross commented 2 months ago

tar_make() would always need read access to run without errors, but tar_read() would not.

Did you mean write access here?

When I meant multiple repositories, I was thinking not of different repositories per target (also useful!), but layers of repositories. LIST would to determine if the target was already built across all repositories, and the appropriate read function would be used to fetch from the most convenient one. Writing would occur in the local or single prioritized repository, and a separate/async process would upload to other repositories if appropriate. I think this would pretty much live entirely live in the CAS plugin and provide a single set of list/read/write functions to tar_repository()

A different question is how to handle targets of type "file". We've had some significant challenges when targets were large files or collections of files and every run required moving them in and out of cloud storage. Breaking out of tar_format() gives more potential flexibility here, and once again it can be up to the CAS plugin to some extent. One option is to have the plugin provide read_file and write_file functions, where one could put caching logic.

wlandau commented 2 months ago

Did you mean write access here?

Yes, I meant write access.

When I meant multiple repositories, I was thinking not of different repositories per target (also useful!), but layers of repositories. LIST would to determine if the target was already built across all repositories, and the appropriate read function would be used to fetch from the most convenient one. Writing would occur in the local or single prioritized repository, and a separate/async process would upload to other repositories if appropriate. I think this would pretty much live entirely live in the CAS plugin and provide a single set of list/read/write functions to tar_repository()

Got it. Yeah, tar_repository() would just need to know about the most immediate/on-demand layer, and any subsequent layers that sync on a schedule or some other way (i.e. Dropbox) would run separately from targets.

A different question is how to handle targets of type "file". We've had some significant challenges when targets were large files or collections of files and every run required moving them in and out of cloud storage. Breaking out of tar_format() gives more potential flexibility here, and once again it can be up to the CAS plugin to some extent. One option is to have the plugin provide read_file and write_file functions, where one could put caching logic.

For repository = tar_respoitory("..."), I was thinking to restrict format = "file" to single files and single directories. That way it is easier to upload to the same place as non-"file" targets and predictably restore the files on download. (For arbitrarily loose collections of files, would be harder to anticipate and control the edge cases.) For directories, it might be useful to automatically create a zip archive before passing it off to the user-defined upload method. That way users would not have to think about this special case.

noamross commented 2 months ago

tar_repository() would just need to know about the most immediate/on-demand layer, and any subsequent layers that sync on a schedule or some other way (i.e. Dropbox) would run separately from targets.

I think it would need to know about all the repositories on read, but only write to the immediate layer. The plug-in's LIST logic could return values across all repositories, and READ would pull from an upstream source (maybe copying to the immediate layer), but WRITE would only go to the immediate layer. But this logic can live within the user/back-end defined functions. One could leave some kind of meta-programming to assemble pre-defined repository layers as a future task for an add-on package.

For directories, it might be useful to automatically create a zip archive before passing it off to the user-defined upload method. That way users would not have to think about this special case.

I think restricting to either single files or single directories is helpful. I have to think through it a bit, but maybe there should be the ability to have the back-end define how it aggregates directories. This is the area where we had the most trouble and a repository layering approach might help, though its tricky. Would the immediate/local representation of files or directories be their un-aggregated values at their regular paths, so that no fetching or decompressing is required?

wlandau commented 2 months ago

maybe there should be the ability to have the back-end define how it aggregates directories.

Makes sense. On targets' end, it could be as simple as supplying the directory path to WRITE, same as a file path.

wlandau commented 2 months ago

There should be some kind optional “list” step at the beginning to make existence checking fast (e.g. with a LIST request in the case of AWS S3).

Instead of LIST, I'm actually thinking tar_repository() should have EXISTS. A user's implementation of EXISTS can call LIST the first time it is used and cache the results in an in-memory environment for later invocations of EXISTS.

wlandau commented 2 months ago

@noamross and @jaredlander, I just merged #1322 to add customizable content-addressable storage to targets. tar_repository_cas() is the fully general interface, and tar_repository_cas_local() gives you a local file-based CAS system.

ropensci / targets

Content addressable storage (CAS) #1314