hubverse-org / hubverse-cloud

Test hub for S3 data submission and storage
MIT License
0 stars 0 forks source link

Investigate the actual behavior of S3 sync #34

Closed bsweger closed 8 months ago

bsweger commented 8 months ago

According to the docs, the AWS s3 sync command:

Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.

The new and updated files bit sounds like what we want when pushing data to S3 when a hub PR is merged.

However, the GitHub workflow doing the sync has some confusing outputs when it runs against data that hasn't been modified by a PR. For example, some output from this run:

Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 7 file(s) remaining
upload: model-output/hub-baseline/2022-10-15-hub-baseline.parquet to s3://hubverse-cloud/model-output/hub-baseline/2022-10-15-hub-baseline.parquet
Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 6 file(s) remaining
Completed 4.6 KiB/53.5 KiB (36.7 KiB/s) with 6 file(s) remaining

Before onboarding large hubs, we should confirm that the sync command is, in fact, only operating on the deltas.

bsweger commented 8 months ago

For context, this 2020 comment states that sync uses a file's timestamp and size to determine whether or not it has changed and is eligible for the operation.

bsweger commented 8 months ago

Create a test bucket (s3://bsweger-sync-test) to do a few checks via the AWS CLI. TL;DR: everything works as expected when running S3 SYNC from a local machine.

First sync Expected behavior: all data in the testy folder is synced to S3 Result: files synced as expected (S3 timestamps reflect the time of sync, not the file timestamps)

➜
ls -la testy/
total 16
drwxr-xr-x@  4 rsweger  128 Mar  6 15:07 .
drwxr-xr-x@ 22 rsweger  704 Mar  6 15:00 ..
-rw-r--r--   1 rsweger  307 Mar  6 15:01 sync-when.csv
-rw-r--r--   1 rsweger  30 Mar  6 15:07 sync-when.txt

➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete
upload: testy/sync-when.txt to s3://bsweger-sync-test/sync-when.txt
upload: testy/sync-when.csv to s3://bsweger-sync-test/sync-when.csv

➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:08:12  307 Bytes sync-when.csv
2024-03-06 15:08:12   30 Bytes sync-when.txt

Total Objects: 2
   Total Size: 337 Bytes

Delete a local file Expected behavior: the file deleted locally should also be removed from S3 (because we use the --delete flag) Result: works as expected

➜
rm testy/sync-when.txt && ls -la testy/
total 8
drwxr-xr-x@  3 rsweger  96 Mar  6 15:13 .
drwxr-xr-x@ 22 rsweger  704 Mar  6 15:00 ..
-rw-r--r--   1 rsweger  307 Mar  6 15:01 sync-when.csv

➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete
delete: s3://bsweger-sync-test/sync-when.txt

➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:08:12  307 Bytes sync-when.csv

Total Objects: 1
   Total Size: 307 Bytes

Run sync w/o changing anything Expected behavior: noop Result: works as expected Note: unlike the GitHub action, this sync command produced no output, a clear indication that there was nothing to do

➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete

➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:08:12  307 Bytes sync-when.csv

Total Objects: 1
   Total Size: 307 Bytes

Run sync on file hcange Expected behavior: updated file syncs Result: works as expected

➜
ls -la testy/
total 8
drwxr-xr-x@  3 rsweger  96 Mar  6 15:13 .
drwxr-xr-x@ 22 rsweger  704 Mar  6 15:00 ..
-rw-r--r--   1 rsweger  306 Mar  6 15:17 sync-when.csv

➜
aws s3 sync testy/ s3://bsweger-sync-test/ --delete
upload: testy/sync-when.csv to s3://bsweger-sync-test/sync-when.csv

➜
aws s3 ls bsweger-sync-test/ --summarize --human-readable --recursive
2024-03-06 15:18:16  306 Bytes sync-when.csv

Total Objects: 1
   Total Size: 306 Bytes
bsweger commented 8 months ago

Run some tests using the s3 sync workflow in the hubverse-cloud repo (limit sync to the /testy folder for simplicity) TL;DR: When S3 SYNC runs in GitHub actions, it re-uploads everything to S3 (working theory: something in the GItHub process alters the file timestamps, so the sync thinks it's a change)

First sync of new /testy folder Expected: files move to S3 Result: files synced as expected (S3 timestamps reflect the time of sync, not the file timestamps)

Completed 306 Bytes/306 Bytes (3.6 KiB/s) with 1 file(s) remaining
upload: testy/sync-when.csv to s3://hubverse-cloud/testy/sync-when.csv

➜
aws s3 ls hubverse-cloud/testy/ --summarize --human-readable --recursive
2024-03-06 15:27:16  306 Bytes testy/sync-when.csv

Total Objects: 1
   Total Size: 306 Bytes

Re-run the GitHub actions w/o another merge or file change Expected: noop Result: file synced again, and the version in S3 has an updated timestamp, even though file's contents are unchanged

Completed 306 Bytes/306 Bytes (2.8 KiB/s) with 1 file(s) remaining
upload: testy/sync-when.csv to s3://hubverse-cloud/testy/sync-when.csv

➜
aws s3 ls hubverse-cloud/testy/ --summarize --human-readable --recursive
2024-03-06 15:36:08  306 Bytes testy/sync-when.csv

Total Objects: 1
   Total Size: 306 Bytes
bsweger commented 8 months ago

In hindsight, what's happening is clear: the timestamp of all repo files in the context of a workflow are set to the time the repo was checked out to the runner VM.

Which explains why the action syncs every file every time.

Two solutions come to mind immediately, there may be others:

  1. Use the --size-only flag with the sync command. This tells AWS to consider only the file size (not the timestamp) when determining whether or not a file has been updated. Don't love this, what if a file has changed but is still the same size?
  2. Use another sync product (rclone was recommended by others in the same situation)

Gonna close this one out because we have, in fact, "investigated the actual behavior of S3 sync."