peak / s5cmd

Parallel S3 and local filesystem execution tool.
MIT License
2.67k stars 237 forks source link

Issue with read after write consistency #531

Open gusostow opened 2 years ago

gusostow commented 2 years ago

Version: 2.0.0

I find that uploading files to S3 are not always immediately available for access.

It was my understanding that this should not happen with strong consistency. Is s5cmd exiting before the PUT request returns?

#!/bin/bash
set -e

while true; do
    name=$(openssl rand -hex 12)
    path="/tmp/$name"
    outpath=/tmp/$name/out
    url=s3://bucket/$name

    mkdir $path
    mkdir $outpath
    touch $path/file

    s5cmd cp "$path/*" $url/ 
    s5cmd --log trace cp "$url/*" $outpath/ 

    if [[ ! -f $outpath/file ]]; then
        echo "$outpath/file doesn't exist"
        exit 1
    else
        echo "$outpath/file exists"
    fi
    sleep 1
done

This tends to fail after a few iterations for me.

Note that a similar variant of this test never fails when file is transferred directly instead of by directory/prefix.

gusostow commented 2 years ago

From looking at the logs more closely, the second command, s5cmd --log trace cp "$url/*" $outpath/, copying back from S3 to local is exiting 0 even though no files are copied based on the wildcard. It's my understanding that s5cmd normal exits 0 and logs an error when this is the case.

See example output from the reproducer:

cp /tmp/19bb0c6e83d17480924f9206/file s3://BUCKET/19bb0c6e83d17480924f9206/file
DEBUG: Request s3/ListObjectsV2 Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /?list-type=2&prefix=aostow%2F19bb0c6e83d17480924f9206%2F HTTP/1.1
Host: asd-tmp.s3.amazonaws.com
User-Agent: aws-sdk-go/1.40.25 (go1.18.3; linux; amd64)
Authorization: AWS4-HMAC-SHA256 Credential=AKIAWP5DEQE432HWSWWU/20221028/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=36910278f845f4e08f6959d0821e3df29b4fdf633facc0a70657ccecba2ed7d7
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20221028T183948Z
Accept-Encoding: gzip

-----------------------------------------------------
DEBUG: Response s3/ListObjectsV2 Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Fri, 28 Oct 2022 18:39:49 GMT
Server: AmazonS3
X-Amz-Bucket-Region: us-east-1
X-Amz-Id-2: Tr7r7r2dyW3MTw9SkOlsfyXxC+R3ORNbZv2RNSqJPZaXqy4mn8FalkBjksHDa0dSQR2BoDQ/ibk=
X-Amz-Request-Id: GH4N1ZXGZ02TE3ST

-----------------------------------------------------
/tmp/19bb0c6e83d17480924f9206/out/file doesn't exist

s5cmd cp exited with 0

s5cmd is listing objects then exiting 0 without copying.

mwkaufman commented 1 year ago

Seeing the same lack of strong consistency with script below, in the latest beta v2.1.0-beta.1-3e08061, v2.0.0, and v1.4.0-d7a0dda:

#!/bin/bash

echo "### COPY"
s5cmd --log debug --json --stat cp s3://bucket1/my.parquet s3://bucket2/my.parquet

echo "### LIST, copied item missing"
s5cmd --log debug --json --stat ls  s3://bucket2/my.parquet
sleep 5

echo "### LIST2, same query after sleep, item is there"
s5cmd --log debug --json --stat ls  s3://bucket2/my.parquet

It's not 100% of the time but over 50% of the time I'd say. I wrote similar scripts using the AWS CLI and github.com/aws/aws-sdk-go v1.40.25 and they never lacked this strong consistency. File used in testing is 122 KB, same behavior from s5cmd whether it was a new file or an overwritten file.

mwkaufman commented 1 year ago

So this seems tied to S3 having a fast clock. i.e. We did a copy at 16:27:27, failed a read at 16:27:28, and then succeeded a read at 16:27:33 showing a LastModified time of 16:27:29. s5cmd discarded the object on the initial read due to this logic ignoring future objects: https://github.com/peak/s5cmd/pull/188/files So when we went back to s5cmd 1.0.0 it works fine.

Thoughts on a flag to optionally disable ignoring files with future timestamps? e.g. s5cmd --show-files-with-future-lastmodified-dates ls s3://bucket2/my.parquet

https://github.com/peak/s5cmd/blob/v2.1.0-beta.1/storage/s3.go#L208 -> if mod.After(now) && ignoreFutureFiles {

denizsurmeli commented 1 year ago

The reason why this problem occurs is having different clocks, as @mwkaufman mentioned. https://github.com/peak/s5cmd/blob/74d21bb4f022edb0bfb69f412c4baa69153ba8c2/storage/s3.go#L333-L337

https://github.com/peak/s5cmd/blob/74d21bb4f022edb0bfb69f412c4baa69153ba8c2/storage/s3.go#L345-L349 These check was added for the #168. When you use pagination when you have more than 1000 according to AWS listObjectsV2, if you have moved an object that matches with the wildcard pattern and if in the next call to the listObjectsV2 the object you have moved occurs, the object gets recopied since it matches the pattern, thus creating an infinite loop. This happens because since s5cmd does not list all the objects then starts copying but it simultaneously does the operations, unlike gsutil where it first lists then copies, and aws-cli has different wildcard permissions than s5cmd has.

This is another approach by #567 , where we request the time from the header, but LastModifiedTime has millisecond precision and the date returned by the header has only second precision. So, imagine the case where you are operating as in #168. You have started at 18.00.555. Omitting the dates, the actual serverTime is 18:00:00.750 but the time returned in the header is 18:00:01. You have moved an object and it has the last modified time of 18:00:00.850. The object skips the filter, causing #168 to reoccur. Observe that this is squeezing the %.0001 percent of possibilities and you can think that this problem still existent in the current implementation.

So, what are some possible solutions ?

denizsurmeli commented 1 year ago

After further research and discussion, we have concluded that the current solution is the best effort solution at the moment. We are leaning towards to add a flag to disable future object check when ls is called.