Closed bsweger closed 8 months ago
Incantation that supports supplying an s3 connection as part of the command (rather than needing to first create a config file):
rclone sync model-output/ :s3,provider=AWS,env_auth:hubverse-cloud/raw/model-output/ --checksum --verbose --stats-one-line
Sample output:
2024/03/07 12:21:43 INFO : voyager-borg1/2022-10-15-voyager-borg1.csv: Copied (replaced existing)
2024/03/07 12:21:43 INFO : hub-baseline/2022-10-08-hub-baseline.csv: Copied (replaced existing)
2024/03/07 12:21:43 INFO : 30.184 KiB / 30.184 KiB, 100%, 0 B/s, ETA -
^ would it make sense to explore feeding this output into whatever function is responsible for transforming data?
As determined in #34, the native AWS
s3 sync
command is not a good fit for GitHub actions, since it relies on a file's timestamp to determine whether or not it has changed since the last sync operation (when the GitHub action checks out code on the runner's virtual machine, all files get the current timestamp, sos3 sync
considers them all updated).Others in our situation have recommended
rclone
as an alternative. A quick local test is promising:--checksum
option that uses a hash function to determine file changes (instead of relying on a file modified dates)I recreated the problem by checking out the
hubverse-cloud
repo to another local directory and running rsync as follows. Despite the more recent timestamps on the newly-cloned files, rclone didn't sync them.