jeanbmar / s3-sync-client

AWS CLI s3 sync command for Node.js
MIT License
82 stars 22 forks source link

Objects show as "updated" and are redownloaded even when unchanged #53

Open noelforte opened 1 year ago

noelforte commented 1 year ago

Environment NodeJS v18.16.0 macOS 13.4 Ventura

Steps to reproduce Sample code for how I'm invoking s3-sync-client (with sensitive values stripped):

Show ```js // Initialize env import 'dotenv/config'; // Load internal modules import path from 'node:path'; import { env, exit } from 'node:process'; // Load external modules import { S3SyncClient } from 's3-sync-client'; // Initialize client const { sync } = new S3SyncClient({ region: ***, endpoint: ***, forcePathStyle: false, credentials: { accessKeyId: ***, secretAccessKey: ***, }, }); const results = await sync( `s3://my-bucket/path/to/directory`, 'output', { del: true, } ); console.log(results); ```

Expected result Items that are unchanged from the remote to the local system should not be recopied.

Actual result Even after an initial successful sync to the local filesystem, s3-sync-client continues to redownload files that haven't changed, incurring additional bandwidth charges.

Here's a screen capture of the network requests going across:

Show ![Screen Recording 2023-06-01 at 8 14 17 PM](https://github.com/jeanbmar/s3-sync-client/assets/2560683/e55e7d6b-d5bd-49e4-9fe9-1425a46b3e6d)

And the resulting output:

Show ```js { created: [], updated: [ BucketObject { id: 'dim-gunger-UO2hOHLq9Y0-unsplash.jpg', size: 1667804, lastModified: 1685662029901, isExcluded: false, bucket: '***', key: 'test/dim-gunger-UO2hOHLq9Y0-unsplash.jpg' }, BucketObject { id: 'luka-verc-D-ChPtXJhXA-unsplash.jpg', size: 2448935, lastModified: 1685662029901, isExcluded: false, bucket: '***', key: 'test/luka-verc-D-ChPtXJhXA-unsplash.jpg' }, BucketObject { id: 'planet-volumes-6tI9Fk5p4bo-unsplash.jpg', size: 385869, lastModified: 1685662029923, isExcluded: false, bucket: '***', key: 'test/planet-volumes-6tI9Fk5p4bo-unsplash.jpg' }, BucketObject { id: 'the-cleveland-museum-of-art-AiD3Pkwmtt0-unsplash.jpg', size: 3881833, lastModified: 1685662030480, isExcluded: false, bucket: '***', key: 'test/the-cleveland-museum-of-art-AiD3Pkwmtt0-unsplash.jpg' }, BucketObject { id: 'yannick-apollon-rYXkqDZxfaw-unsplash.jpg', size: 13356953, lastModified: 1685662030513, isExcluded: false, bucket: '***', key: 'test/yannick-apollon-rYXkqDZxfaw-unsplash.jpg' } ], deleted: [] } ```

Other items of note:

Happy to provide any other relevant details!

jeanbmar commented 1 year ago

Thank you for the very detailed report.

Can you please run the following code and paste the outputs of the two calls here:

import { S3SyncClient, ListLocalObjectsCommand, ListBucketObjectsCommand } from 's3-sync-client';
const client = new S3SyncClient({ /* your config */ });

console.log(
  await client.send(
    new ListLocalObjectsCommand({
      directory: 'output',
    })
  )
);

console.log(
  await client.send(
    new ListBucketObjectsCommand({
      bucket: 'my-bucket',
      prefix: 'path/to/directory',
    })
  )
);

The diff code for updates is pretty simple:

if (
  sourceObject.size !== targetObject.size ||
  (options?.sizeOnly !== true &&
    sourceObject.lastModified > targetObject.lastModified)
) {
  updated.push(sourceObject);
}

Let's see if the issue comes from values or maybe value types.

noelforte commented 1 year ago

Sure thing, here's the local object output, truncated for brevity:

[
  LocalObject {
    id: 'test-obj-a.jpg',
    size: 1667804,
    lastModified: 1685735624000,
    isExcluded: false,
    path: 'output/test-obj-a.jpg'
  },
  LocalObject {
    id: 'test-obj.b.jpg',
    size: 385869,
    lastModified: 1685735634000,
    isExcluded: false,
    path: 'output/test-obj.b.jpg'
  }
]

and the bucket object output:

[
  BucketObject {
    id: 'test/test-obj-a.jpg',
    size: 1667804,
    lastModified: 1685735624935,
    isExcluded: false,
    bucket: '...',
    key: 'test/test-obj-a.jpg'
  },
  BucketObject {
    id: 'test/test-obj-b.jpg',
    size: 385869,
    lastModified: 1685735766762,
    isExcluded: false,
    bucket: '...',
    key: 'test/test-obj-b.jpg'
  }
]

Looks like the lastModified values of the local files are returning rounded down to 1000 seconds.

jeanbmar commented 1 year ago

I've made tests on S3, and it seems that AWS doesn't store milliseconds for the LastModified field.

Ref: https://github.com/aws/aws-cli/issues/5369

My test with official AWS SDK commands:

await s3Client.send(
  new PutObjectCommand({
    Bucket: BUCKET_2,
    Key: 'def/jkl/xmoj',
    Body: Buffer.from('0x1234', 'hex'),
  })
);

console.log(
  (
    await s3Client.send(
      new ListObjectsV2Command({
        Bucket: BUCKET_2,
        Prefix: 'def/jkl/xmoj',
      })
    )
  ).Contents.map(({ LastModified }) => LastModified.getTime())
);

// => [ 1685740748000 ]

console.log(
  (
    await s3Client.send(
      new GetObjectCommand({
        Bucket: BUCKET_2,
        Key: 'def/jkl/xmoj',
      })
    )
  ).LastModified.getTime()
);

// => 1685740748000

Can you run the last two commands on test/test-obj-a.jpg ? s3Client here is S3Client instance from the official SDK. I have the feeling your provider (or the official AWS SDK) might return inconsistent timestamps between ListObjectsV2Command and GetObjectCommand, which would explain the issue.

noelforte commented 1 year ago

You are correct, that is the case. Here's the output:

console.log(
  (
    await clientS3.send(
      new ListObjectsV2Command({
        Bucket: env.S3_BUCKET,
        Prefix: 'test/test-obj-a.jpg',
      })
    )
  ).Contents.map(({ LastModified }) => LastModified.getTime())
);

// => [ 1685735624935 ]

console.log(
  (
    await clientS3.send(
      new GetObjectCommand({
        Bucket: env.S3_BUCKET,
        Key: 'test/test-obj-a.jpg',
      })
    )
  ).LastModified.getTime()
);

// => 1685735624000

Is there anything that can be done to work around that by disregarding the milliseconds if they are returned?

jeanbmar commented 1 year ago

I'm not sure we can safely round or truncate values. If you look at test-obj.b.jpg in https://github.com/jeanbmar/s3-sync-client/issues/53#issuecomment-1574246333, we get 1685735634000 and 1685735766762 dates while sizes are the same.

I would suggest opening a ticket with the providers and in the meantime using the sizeOnly: true option when doing sync. Size comparison should be good enough.

noelforte commented 1 year ago

Whoops! That was a mistake on my part, I think I changed something in https://github.com/jeanbmar/s3-sync-client/issues/53#issuecomment-1574246333 that caused the times to shift (test-obj-a vs test-obj.a), so that's where the inconsistency came from. After the last test I did in https://github.com/jeanbmar/s3-sync-client/issues/53#issuecomment-1574665098 and made sure that the local objects and remote objects were identical, this is the output:

[
  LocalObject {
    id: 'test-obj-a.jpg',
    size: 1667804,
    lastModified: 1685735624000,
    isExcluded: false,
    path: 'output/test-obj-a.jpg'
  },
  LocalObject {
    id: 'test-obj-b.jpg',
    size: 385869,
    lastModified: 1685735766000,
    isExcluded: false,
    path: 'output/test-obj-b.jpg'
  }
]
[
  BucketObject {
    id: 'test/test-obj-a.jpg',
    size: 1667804,
    lastModified: 1685735624935,
    isExcluded: false,
    bucket: 'my-bucket',
    key: 'test/test-obj-a.jpg'
  },
  BucketObject {
    id: 'test/test-obj-b.jpg',
    size: 385869,
    lastModified: 1685735766762,
    isExcluded: false,
    bucket: 'my-bucket',
    key: 'test/test-obj-b.jpg'
  }
]

As you can see, the timestamps for each object are exactly the same apart from the milliseconds, so it doesn't appear to be an issue with the provider.