Open gusostow opened 2 years ago
From looking at the logs more closely, the second command, s5cmd --log trace cp "$url/*" $outpath/
, copying back from S3 to local is exiting 0 even though no files are copied based on the wildcard. It's my understanding that s5cmd normal exits 0 and logs an error when this is the case.
See example output from the reproducer:
cp /tmp/19bb0c6e83d17480924f9206/file s3://BUCKET/19bb0c6e83d17480924f9206/file
DEBUG: Request s3/ListObjectsV2 Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /?list-type=2&prefix=aostow%2F19bb0c6e83d17480924f9206%2F HTTP/1.1
Host: asd-tmp.s3.amazonaws.com
User-Agent: aws-sdk-go/1.40.25 (go1.18.3; linux; amd64)
Authorization: AWS4-HMAC-SHA256 Credential=AKIAWP5DEQE432HWSWWU/20221028/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=36910278f845f4e08f6959d0821e3df29b4fdf633facc0a70657ccecba2ed7d7
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20221028T183948Z
Accept-Encoding: gzip
-----------------------------------------------------
DEBUG: Response s3/ListObjectsV2 Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Fri, 28 Oct 2022 18:39:49 GMT
Server: AmazonS3
X-Amz-Bucket-Region: us-east-1
X-Amz-Id-2: Tr7r7r2dyW3MTw9SkOlsfyXxC+R3ORNbZv2RNSqJPZaXqy4mn8FalkBjksHDa0dSQR2BoDQ/ibk=
X-Amz-Request-Id: GH4N1ZXGZ02TE3ST
-----------------------------------------------------
/tmp/19bb0c6e83d17480924f9206/out/file doesn't exist
s5cmd cp exited with 0
s5cmd is listing objects then exiting 0 without copying.
Seeing the same lack of strong consistency with script below, in the latest beta v2.1.0-beta.1-3e08061, v2.0.0, and v1.4.0-d7a0dda:
#!/bin/bash
echo "### COPY"
s5cmd --log debug --json --stat cp s3://bucket1/my.parquet s3://bucket2/my.parquet
echo "### LIST, copied item missing"
s5cmd --log debug --json --stat ls s3://bucket2/my.parquet
sleep 5
echo "### LIST2, same query after sleep, item is there"
s5cmd --log debug --json --stat ls s3://bucket2/my.parquet
It's not 100% of the time but over 50% of the time I'd say. I wrote similar scripts using the AWS CLI and github.com/aws/aws-sdk-go v1.40.25 and they never lacked this strong consistency. File used in testing is 122 KB, same behavior from s5cmd whether it was a new file or an overwritten file.
So this seems tied to S3 having a fast clock. i.e. We did a copy at 16:27:27, failed a read at 16:27:28, and then succeeded a read at 16:27:33 showing a LastModified time of 16:27:29. s5cmd discarded the object on the initial read due to this logic ignoring future objects: https://github.com/peak/s5cmd/pull/188/files So when we went back to s5cmd 1.0.0 it works fine.
Thoughts on a flag to optionally disable ignoring files with future timestamps? e.g. s5cmd --show-files-with-future-lastmodified-dates ls s3://bucket2/my.parquet
https://github.com/peak/s5cmd/blob/v2.1.0-beta.1/storage/s3.go#L208 -> if mod.After(now) && ignoreFutureFiles {
The reason why this problem occurs is having different clocks, as @mwkaufman mentioned. https://github.com/peak/s5cmd/blob/74d21bb4f022edb0bfb69f412c4baa69153ba8c2/storage/s3.go#L333-L337
https://github.com/peak/s5cmd/blob/74d21bb4f022edb0bfb69f412c4baa69153ba8c2/storage/s3.go#L345-L349
These check was added for the #168. When you use pagination when you have more than 1000 according to AWS listObjectsV2, if you have moved an object that matches with the wildcard pattern and if in the next call to the listObjectsV2
the object you have moved occurs, the object gets recopied since it matches the pattern, thus creating an infinite loop. This happens because since s5cmd
does not list all the objects then starts copying but it simultaneously does the operations, unlike gsutil where it first lists then copies, and aws-cli has different wildcard permissions than s5cmd
has.
This is another approach by #567 , where we request the time from the header, but LastModifiedTime
has millisecond precision and the date returned by the header has only second precision. So, imagine the case where you are operating as in #168. You have started at 18.00.555
. Omitting the dates, the actual serverTime
is 18:00:00.750
but the time returned in the header is 18:00:01
. You have moved an object and it has the last modified time of 18:00:00.850
. The object skips the filter, causing #168 to reoccur. Observe that this is squeezing the %.0001 percent of possibilities and you can think that this problem still existent in the current implementation.
So, what are some possible solutions ?
gsutil
uses. This creates memory pressure if there are too many keys. ls
and moving the check to the necessary commands. After further research and discussion, we have concluded that the current solution is the best effort solution at the moment. We are leaning towards to add a flag to disable future object check when ls
is called.
Version: 2.0.0
I find that uploading files to S3 are not always immediately available for access.
It was my understanding that this should not happen with strong consistency. Is s5cmd exiting before the PUT request returns?
This tends to fail after a few iterations for me.
Note that a similar variant of this test never fails when
file
is transferred directly instead of by directory/prefix.