Closed bsweger closed 8 months ago
For context, this 2020 comment states that sync
uses a file's timestamp and size to determine whether or not it has changed and is eligible for the operation.
Create a test bucket (s3://bsweger-sync-test
) to do a few checks via the AWS CLI.
TL;DR: everything works as expected when running S3 SYNC
from a local machine.
First sync
Expected behavior: all data in the testy
folder is synced to S3
Result: files synced as expected (S3 timestamps reflect the time of sync, not the file timestamps)
➜
ls -la testy/
total 16
drwxr-xr-x@ 4 rsweger 128 Mar 6 15:07 .
drwxr-xr-x@ 22 rsweger 704 Mar 6 15:00 ..
-rw-r--r-- 1 rsweger 307 Mar 6 15:01 sync-when.csv
-rw-r--r-- 1 rsweger 30 Mar 6 15:07 sync-when.txt
➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete
upload: testy/sync-when.txt to s3://bsweger-sync-test/sync-when.txt
upload: testy/sync-when.csv to s3://bsweger-sync-test/sync-when.csv
➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:08:12 307 Bytes sync-when.csv
2024-03-06 15:08:12 30 Bytes sync-when.txt
Total Objects: 2
Total Size: 337 Bytes
Delete a local file
Expected behavior: the file deleted locally should also be removed from S3 (because we use the --delete
flag)
Result: works as expected
➜
rm testy/sync-when.txt && ls -la testy/
total 8
drwxr-xr-x@ 3 rsweger 96 Mar 6 15:13 .
drwxr-xr-x@ 22 rsweger 704 Mar 6 15:00 ..
-rw-r--r-- 1 rsweger 307 Mar 6 15:01 sync-when.csv
➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete
delete: s3://bsweger-sync-test/sync-when.txt
➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:08:12 307 Bytes sync-when.csv
Total Objects: 1
Total Size: 307 Bytes
Run sync w/o changing anything Expected behavior: noop Result: works as expected Note: unlike the GitHub action, this sync command produced no output, a clear indication that there was nothing to do
➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete
➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:08:12 307 Bytes sync-when.csv
Total Objects: 1
Total Size: 307 Bytes
Run sync on file hcange Expected behavior: updated file syncs Result: works as expected
➜
ls -la testy/
total 8
drwxr-xr-x@ 3 rsweger 96 Mar 6 15:13 .
drwxr-xr-x@ 22 rsweger 704 Mar 6 15:00 ..
-rw-r--r-- 1 rsweger 306 Mar 6 15:17 sync-when.csv
➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete
upload: testy/sync-when.csv to s3://bsweger-sync-test/sync-when.csv
➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:18:16 306 Bytes sync-when.csv
Total Objects: 1
Total Size: 306 Bytes
Run some tests using the s3 sync workflow in the hubverse-cloud
repo (limit sync to the /testy
folder for simplicity)
TL;DR: When S3 SYNC
runs in GitHub actions, it re-uploads everything to S3 (working theory: something in the GItHub process alters the file timestamps, so the sync thinks it's a change)
First sync of new /testy folder Expected: files move to S3 Result: files synced as expected (S3 timestamps reflect the time of sync, not the file timestamps)
Completed 306 Bytes/306 Bytes (3.6 KiB/s) with 1 file(s) remaining
upload: testy/sync-when.csv to s3://hubverse-cloud/testy/sync-when.csv
➜
aws s3 ls hubverse-cloud/testy/ --summarize --human-readable --recursive
2024-03-06 15:27:16 306 Bytes testy/sync-when.csv
Total Objects: 1
Total Size: 306 Bytes
Re-run the GitHub actions w/o another merge or file change Expected: noop Result: file synced again, and the version in S3 has an updated timestamp, even though file's contents are unchanged
Completed 306 Bytes/306 Bytes (2.8 KiB/s) with 1 file(s) remaining
upload: testy/sync-when.csv to s3://hubverse-cloud/testy/sync-when.csv
➜
aws s3 ls hubverse-cloud/testy/ --summarize --human-readable --recursive
2024-03-06 15:36:08 306 Bytes testy/sync-when.csv
Total Objects: 1
Total Size: 306 Bytes
In hindsight, what's happening is clear: the timestamp of all repo files in the context of a workflow are set to the time the repo was checked out to the runner VM.
Which explains why the action syncs every file every time.
Two solutions come to mind immediately, there may be others:
--size-only
flag with the sync command. This tells AWS to consider only the file size (not the timestamp) when determining whether or not a file has been updated. Don't love this, what if a file has changed but is still the same size?Gonna close this one out because we have, in fact, "investigated the actual behavior of S3 sync."
According to the docs, the AWS
s3 sync
command:The
new and updated files
bit sounds like what we want when pushing data to S3 when a hub PR is merged.However, the GitHub workflow doing the sync has some confusing outputs when it runs against data that hasn't been modified by a PR. For example, some output from this run:
Before onboarding large hubs, we should confirm that the
sync
command is, in fact, only operating on the deltas.