Closed wlandau closed 11 months ago
For this to work, I think I will need to switch to using ETags as hashes instead of the targets
custom hash in the metadata. I think the reason I didn't do this initially was because I didn't know that S3 was strongly read-after-write consistent.
Roadmap for AWS:
aws_s3_list()
in the utils. Remember pagination.store_aws_hash()
to use a cache. This function should only be called locally in the central controlling R session. I could put guardrails to make sure that stays the case.Unfortunately list_objects_v2() does not return version Ids, and list_object_Versions() returns too much information (never just the most current objects). So it looks like this caching will not be version-aware and will have to fall back on HEAD requests if you git reset
your way back to historical metadata.
For GCS, it might be good to just switch to ETags for the next release, then wait for https://github.com/cloudyr/googleCloudStorageR/issues/179.
Hmm.... I don't think we need to switch to ETags for hashes. We could just store the ETag as part of the metadata and use ETags instead of versions to corroborate objects.
I thought this through a bit more, and unfortunately this batched caching feature no longer seems feasible.
As I said before, list_object_versions()
is not feasible because it lists all the versions of all the objects, without any kind of guardrail to list e.g. only the most recent versions. Any given object could have thousands of versions, and so listing all the versions of all the objects is way too much.
On the other hand, neither list_objects()
nor list_objects_v2()
lists version IDs at all, so it is impossible to confirm that the version listed in the metadata actually exists or is current. For example, suppose you revert to a historical copy of the metadata, and you see version ABC and ETag XYZ for target x. The bucket's current version could have ETag XYZ, but version ABC may no longer exist. (For example, it might have been automatically deleted by the object retention policy).
These and similar problems are impossible to reconcile unless:
targets
sends a HEAD request for each individual object, as it currently does, or(2) seems impossible, so I think we have to stick with (1).
Tried to send a feature request on their feedback form, but it's glitchy today:
I am writing an R package which needs to check the existence of a specific version of each AWS S3 object in its data store. The version of a given object is the version ID recorded in the local metadata, and the recorded version may or may not be the most current version in the bucket. Currently, the package accomplishes this by sending a HEAD request for each relevant object-version pair.
I would like a more efficient/batched way to do this for each version/object pair. list_object_versions() returns every version of every object of interest, which is way too many versions to download efficiently, and neither list_objects() nor list_objects_v2() return any version IDs at all. It would be great to have something like delete_objects(), but instead of deleting the objects, accept the supplied key-version pairs and return the ETag and custom metadata of each one that exists.
c.f. https://repost.aws/questions/QUe-yNsIr0Td2aq2oA1RAQdQ/hudi-and-s3-object-versions
Note to self: if it ever becomes possible to revisit this issue, I will probably need to switch targets
to use AWS/GCS ETags when available instead of custom local file hashes. The switch is as simple as this:
store_upload_object_aws()
, remove the targets-hash
custom metadata: https://github.com/ropensci/targets/blob/13470eff47d1d4a87abe2ee398257fcb27b580ec/R/class_aws.R#L227
store_upload_object_aws()
, write store$file$hash <- digest_chr64(head$ETag)
just above the following line:https://github.com/ropensci/targets/blob/13470eff47d1d4a87abe2ee398257fcb27b580ec/R/class_aws.R#L249
store_aws_hash()
, return digest_chr64(head$ETag)
instead of head$Metadata[["targets-hash"]]
.store_aws_hash()
to assert that up-to-date targets are indeed up to date.Taking a step back: this is actually feasible if targets
can ignore version IDs. There could be a tar_option_set()
-level option to either check or ignore version IDs. Things to consider:
tar_option_set()
and not tar_target()
? At first glance, I thinks so because caching happens in bulk. Maybe the level of tar_resources_aws()
could technically work, but those options are all implicitly target-level, which would be counterintuitive even with good documentation.Taking another step back: targets
should:
(1) ensures behavior is clear, consistent, compatible, and version-aware. (2) ensures a target reruns if it is not the current object in the bucket. (2) also makes this issue so much easier to implement. And it lets us avoid adding a new version
argument of tar_resources_aws()
. The outcomes will be:
Under the default settings for cloud storage,
targets
checks each and every target hash with its own AWS API call, which is extremely time-consuming. This is why https://books.ropensci.org/targets/cloud-storage.html recommendtar_cue(file = FALSE)
for large pipelines on the cloud. This is fine if you're not manually modifying objects in the bucket, but it is not ideal. It would be better to find a safer way to speed uptargets
when it checks that cloud objects are up to date.Previously I posted https://github.com/ropensci/targets/issues/1131. Versioning might not be a problem if we assume most of the objects are in their current version most of the time. However,
list_objects_v2()
operates on whole prefixes, which might slow us down because it operates on more objects than we really need. And then there's pagination to contend with. This functionality is worth revisiting, but the ideas I have so far range from painful to infeasible.