hubverse-org / hubverse-cloud

Test hub for S3 data submission and storage
MIT License
0 stars 0 forks source link

Switch sync utility used in hubverse-aws-upload workflow #36

Closed bsweger closed 6 months ago

bsweger commented 6 months ago

As determined in #34, the native AWS s3 sync command is not a good fit for GitHub actions, since it relies on a file's timestamp to determine whether or not it has changed since the last sync operation (when the GitHub action checks out code on the runner's virtual machine, all files get the current timestamp, so s3 sync considers them all updated).

Others in our situation have recommended rclone as an alternative. A quick local test is promising:

I recreated the problem by checking out the hubverse-cloud repo to another local directory and running rsync as follows. Despite the more recent timestamps on the newly-cloned files, rclone didn't sync them.

rclone sync model-output/ s3-test:hubverse-cloud/raw/model-output/ --checksum --verbose
2024/03/07 10:46:54 INFO  : There was nothing to transfer
2024/03/07 10:46:54 INFO  :
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Checks:                 7 / 7, 100%
Elapsed time:         0.2s
bsweger commented 6 months ago

Incantation that supports supplying an s3 connection as part of the command (rather than needing to first create a config file):

rclone sync model-output/ :s3,provider=AWS,env_auth:hubverse-cloud/raw/model-output/ --checksum --verbose --stats-one-line

Sample output:

2024/03/07 12:21:43 INFO  : voyager-borg1/2022-10-15-voyager-borg1.csv: Copied (replaced existing)
2024/03/07 12:21:43 INFO  : hub-baseline/2022-10-08-hub-baseline.csv: Copied (replaced existing)
2024/03/07 12:21:43 INFO  :    30.184 KiB / 30.184 KiB, 100%, 0 B/s, ETA -

^ would it make sense to explore feeding this output into whatever function is responsible for transforming data?