Open guy-har opened 1 month ago
Root Cause Analysis Well, the number of parts returned by list is clearly way lower than the number copied, and we can observe in the console 100's of copied parts.
Possible Root Cause:
Possible Solution: Comparing with S3, we don't build the list of parts via list command, rather we transform the list of parts we have been building with each upload- maybe that is the way to go here
I suspect that using the startoffset field will help me page the list operation https://pkg.go.dev/cloud.google.com/go/storage#section-readme
listRes.txt this is an example where it said "part list mismatch - expected 87 parts, got 954:
This query tag seems to help err := query.SetAttrSelection([]string{"Name", "Etag"}) Sometimes it passes, sometimes the list gets most but not all of the objects(usually the later ones)
Maybe the way to go is to get rid of the list operation- anyway we know the names...
Update:
Test result: 8GB, 5MB ERROR [2024-07-03T12:20:48+03:00]pkg/block/gs/adapter.go:514 pkg/block/gs.(Adapter).CompleteMultiPartUpload CompleteMultipartUpload failed error="googleapi: Error 404: No such object: nadav_bucket_7822/data/gf4v5neiuvoinvg6qugg/cq2hbreiuvoinvg6qui0.part_00001, notFound" host="localhost:8000" key=data/gf4v5neiuvoinvg6qugg/cq2hbreiuvoinvg6qui0 matched_host=false method=POST operation_id=post_object path=bigfile.txt physical_address=data/gf4v5neiuvoinvg6qugg/cq2hbreiuvoinvg6qui0 qualified_key=data/gf4v5neiuvoinvg6qugg/cq2hbreiuvoinvg6qui0 qualified_ns=nadav_bucket_7822 ref=main repository=gcptest request_id=c9042832-c456-48f4-94fe-e006f744d5f2 service_name=s3_gateway upload_id=data/gf4v5neiuvoinvg6qugg/cq2hbreiuvoinvg6qui0 user=admin The failure is very strange since the file is clearly visible in the GCS console 8GB, 6MB FAIL- same error 8GB, 7MB, SUCCEED 8GB, 8MB, SUCCEED 8GB, 16MB SUCCEED 10GB, 8MBERROR [2024-07-03T14:54:36+03:00]pkg/block/gs/adapter.go:514 pkg/block/gs.(Adapter).CompleteMultiPartUpload CompleteMultipartUpload failed error="context canceled" host="localhost:8000" key=data/gf4siomiuvoknlkh9bcg/cq2jjg6iuvoknlkh9bd0 matched_host=false method=POST operation_id=post_object path=bigfile10GB.txt physical_address=data/gf4siomiuvoknlkh9bcg/cq2jjg6iuvoknlkh9bd0 qualified_key=data/gf4siomiuvoknlkh9bcg/cq2jjg6iuvoknlkh9bd0 qualified_ns=nadav_bucket_7822 ref=main repository=gcptest request_id=44b14c25-0336-44dd-8432-4071438df456 service_name=s3_gateway upload_id=data/gf4siomiuvoknlkh9bcg/cq2jjg6iuvoknlkh9bd0 user=admin ERROR [2024-07-03T14:54:36+03:00]pkg/gateway/operations/postobject.go:122 pkg/gateway/operations.(*PostObject).HandleCompleteMultipartUpload could not complete multipart upload error="context canceled" host="localhost:8000" matched_host=false method=POST operation_id=post_object path=bigfile10GB.txt physical_address=data/gf4siomiuvoknlkh9bcg/cq2jjg6iuvoknlkh9bd0 ref=main repository=gcptest request_id=44b14c25-0336-44dd-8432-4071438df456 service_name=s3_gateway upload_id=data/gf4siomiuvoknlkh9bcg/cq2jjg6iuvoknlkh9bd0 user=admin
With the fix to the validation code, both 8GB and 10GB file cases now succeed, without touching the client timeout. The AWS CLI times out on the last call after 60 seconds, retries and completes the upload!
I wonder if ComposeAll can be written concurrently so that it doesn't time out...
When using lakeFS with GCS as the block adapter, moving ~8GB files via AWS CLI through S3 GW (default multipart upload chunk size of 8MB) fails on complete multipart upload due to unexpected number of parts.
Addition by Nadav: OK I believe that I reproduced the problems(with Elad's help)
Command line: nadavsteindler@Nadavs-MacBook-Pro Downloads % aws --endpoint-url http://localhost:8000 s3 cp ~/Desktop/bigfile.txt s3://gcptest/main/ upload failed: ../Desktop/bigfile.txt to s3://gcptest/main/bigfile.txt An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 2): We encountered an internal error, please try again.
lakefs Log: ERROR [2024-06-30T19:01:47+03:00]pkg/gateway/operations/postobject.go:122 pkg/gateway/operations.(*PostObject).HandleCompleteMultipartUpload could not complete multipart upload error="part list mismatch - expected 76 parts, got 954: multipart part list mismatch" host="localhost:8000" matched_host=false method=POST operation_id=post_object path=bigfile.txt physical_address=data/gf6or3eiuvokunnmsks0/cq0nt2miuvokunnmsli0 ref=main repository=gcptest request_id=737df8cf-2535-4a4b-8e8a-5a784ce9433a service_name=s3_gateway upload_id=data/gf6or3eiuvokunnmsks0/cq0nt2miuvokunnmsli0 user=admin
Ah but this is after the CLI does a retry- the error from the 1st try is ERROR [2024-07-03T16:13:07+03:00]pkg/block/gs/adapter.go:525 pkg/block/gs.(Adapter).CompleteMultiPartUpload CompleteMultipartUpload failed error="context canceled" host="localhost:8000" key=data/gf4rde6iuvolaihdbt40/cq2kot6iuvolaihdbt4g matched_host=false method=POST operation_id=post_object path=bigfile.txt physical_address=data/gf4rde6iuvolaihdbt40/cq2kot6iuvolaihdbt4g qualified_key=data/gf4rde6iuvolaihdbt40/cq2kot6iuvolaihdbt4g qualified_ns=nadav_bucket_7822 ref=main repository=gcptest request_id=486dfb58-1cec-46e1-aca6-0e534bef4bfa service_name=s3_gateway upload_id=data/gf4rde6iuvolaihdbt40/cq2kot6iuvolaihdbt4g user=admin ERROR [2024-07-03T16:13:07+03:00]pkg/gateway/operations/postobject.go:122 pkg/gateway/operations.(PostObject).HandleCompleteMultipartUpload could not complete multipart upload error="context canceled" host="localhost:8000" matched_host=false method=POST operation_id=post_object path=bigfile.txt physical_address=data/gf4rde6iuvolaihdbt40/cq2kot6iuvolaihdbt4g ref=main repository=gcptest request_id=486dfb58-1cec-46e1-aca6-0e534bef4bfa service_name=s3_gateway upload_id=data/gf4rde6iuvolaihdbt40/cq2kot6iuvolaihdbt4g user=admin
Which seems to indicate a timeout from the aws cli
Filesize Partsize Result 1GB, 1MB, SUCCEED 4GB, 1MB, SUCCEED 6GB, 8MB, SUCCEED 8GB, 5MB part list mismatch - expected 627 parts, got 1526 8GB, 8MB, part list mismatch - expected 76 parts, got 954, expected 30 parts, got 954, expected 117 parts, got 954, etc. 8GB, 16MB SUCCEED 10GB, 8MB part list mismatch - expected 46 parts, got 1193:
BUT if I increase the AWS CLI Timeout it succeeds aws --cli-read-timeout 300 --endpoint-url http://localhost:8000 s3 cp ~/Desktop/bigfile.txt s3://gcptest/main/
Root Cause: When we try to upload large files with many parts to merge, the AWS CLI hits a timeout on the CompleteMultipart call. It then retries the call and this time fails when the number of parts doesn't match, since we have already concatenated some of the parts.
Note on gs.adapter.go:composeMultipartUploadParts
Proposed fix: We can support retry on timeout by loosening up the validate criteria. Instead of insisting on an exact match, we can check
Fix option 2: We don't support retry. On failure we clear all the uploaded files and return the timeout error and you have to start over. If it does a retry, you should immediately get a file not found error. This simplifies the code, you don't have to list objects, you just use the original filenames and if one is not found because we already composed it with others- error.