peak / s5cmd

Parallel S3 and local filesystem execution tool.
MIT License
2.67k stars 237 forks source link

Retries failed when aws-sdk-go not able to send the request and s5cmd stuck #554

Open charz opened 1 year ago

charz commented 1 year ago

Just saw the s5cmd is doing some retries, but it stuck when aws-sdk-go raised send request failed.

logs

cp s3://test/object
DEBUG retryable error: RequestError: send request failed
caused by : GET "https://test-server/test/object": EOF

From storage/s3.go

    if shouldRetry && req.Error != nil {
        err := fmt.Errorf("retryable error: %v", req.Error)
        msg := log.DebugMessage{Err: err.Error()}
        log.Debug(msg)
    }

From aws-sdk-go

    // Catch all request errors, and let the default retrier determine
    // if the error is retryable.
    r.Error = awserr.New(request.ErrCodeRequestError, "send request failed", err)

And it seems like the sdk won't retry when it hits ErrCodeRequestError (https://github.com/aws/aws-sdk-go/blob/main/aws/request/retryer.go#L181-L184)

Should we just raise error instead of stuck the whole process?

HugoKuo commented 1 year ago

` cp s3://abcData/x.test /mnt/sync/x.test DEBUG retryable error: RequestError: send request failed caused by: Get "https://xxx.xxx.xx/abcData/x.test": EOF DEBUG: Request s3/GetObject Details: ---[ REQUEST POST-SIGN ]----------------------------- GET /abcData/x.test HTTP/1.1 Host: xxx.xxx.xx User-Agent: aws-sdk-go/1.40.25 (go1.18.3; linux; amd64) S3Manager Authorization: AWS4-HMAC-SHA256 Credential=team-adlr-synth-data/20230320/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=451c8e51153482f2375f19514b0dbc0b767121cb1d8e91fe4d0faad9e52b1198 Range: bytes=0-52428799 X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 X-Amz-Date: 20230320T134820Z


DEBUG: Response s3/GetObject Details: ---[ RESPONSE ]-------------------------------------- HTTP/1.1 206 Partial Content Content-Length: 6220816 Content-Range: bytes 0-6220815/6220816 Content-Type: application/octet-stream Date: Mon, 20 Mar 2023 13:48:20 GMT Etag: "ea62efc01261c9bf87bcfc9b0b7b3ab5" Last-Modified: Fri, 17 Mar 2023 03:35:39 GMT X-Amz-Id-2: tx94434c2aa17f4d869461d-0064186424 X-Amz-Request-Id: tx94434c2aa17f4d869461d-0064186424 X-Openstack-Request-Id: tx94434c2aa17f4d869461d-0064186424 X-Trans-Id: tx94434c2aa17f4d869461d-0064186424`

Many I know the reason of the EOF. Is it from saving data to local disk?

PDD777 commented 10 months ago

It appears that this might be similar to what I'm finding as well.

I'm copying 2 dirs within the bucket, and some files I would get an "unexpected EOF" error, this was using the sync command, with only the --endpoint-url set, as we use Linode and not the very very pricey AWS.

Thinking that it might be the files, I tried a few files individually, no error, which confirms, not the file or the endpoints.

So I moved to the cp -n -u -s command, got this today.

Same "unexpected EOF" error on some files, but it spat out this as well.

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xa35623] goroutine 81859 [running]: github.com/peak/s5cmd/v2/command.Copy.shouldOverride({0xc0000a1400, 0xc0000a15e0, {0xb8d255, 0x2}, {0xc0000dcfc0, 0x68}, 0x0, 0x1, 0x1, 0x1, ...}, ...) /home/runner/work/s5cmd/s5cmd/command/cp.go:850 +0x1a3 github.com/peak/s5cmd/v2/command.Copy.doDownload({0xc0000a1400, 0xc0000a15e0, {0xb8d255, 0x2}, {0xc0000dcfc0, 0x68}, 0x0, 0x1, 0x1, 0x1, ...}, ...) /home/runner/work/s5cmd/s5cmd/command/cp.go:613 +0x118 github.com/peak/s5cmd/v2/command.Copy.prepareDownloadTask.func1() /home/runner/work/s5cmd/s5cmd/command/cp.go:567 +0xf3 github.com/peak/s5cmd/v2/parallel.(*Manager).Run.func1() /home/runner/work/s5cmd/s5cmd/parallel/parallel.go:57 +0x8a created by github.com/peak/s5cmd/v2/parallel.(*Manager).Run /home/runner/work/s5cmd/s5cmd/parallel/parallel.go:53 +0xca

It looks like a mem leak, and for large copies the application fails to release malloc to proceed in the cp process.

The system we are running this on has: RAM: 8GB CPU Cores: 2x2 Node: Proxmox VM

s5cmd version v2.2.2-48f7e59

We are trying to cp 6+TB/25000 objects as an offsite backup location from s3 to local NFS store.

I might open a new bug, but leave this comment here too, not sure how related this might be.