pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.26k stars 5.84k forks source link

Dumpling will not retry upload to GCS if the server returned 503 #56127

Open kennytm opened 1 month ago

kennytm commented 1 month ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

(I used mitmproxy to inject the 503 error. For testing there should be some zero-external-dependency means to do so :thinking: Also I used a local fake-gcs-server serving HTTP to avoid distractions of installing the self-signed TLS CA.)

  1. Get mitmproxy

  2. Prepare the following script, which will inject 503 for the first two requests to */o (the URL for uploading objects to GCS)

    # inject-503.py
    from mitmproxy import http
    
    class InjectErrorAddon:
        def __init__(self):
            self._counter = 0
    
        def request(self, flow: http.HTTPFlow) -> None:
            path = flow.request.path_components
            if path and path[-1] == 'o':
                self._counter += 1
                if self._counter <= 2:
                    flow.response = http.Response.make(503, '{}')
    
    addons = [InjectErrorAddon()]
  3. Run mitmproxy loaded with this script

    mitmproxy -s inject-503.py
  4. Patch dumpling to use this proxy:

    diff --git a/dumpling/export/config.go b/dumpling/export/config.go
    index 52337ec732..e960d13c7a 100644
    --- a/dumpling/export/config.go
    +++ b/dumpling/export/config.go
    @@ -12,6 +12,8 @@ import (
        "strings"
        "text/template"
        "time"
    +   "net/http"
    +   "net/url"
    
        "github.com/coreos/go-semver/semver"
        "github.com/docker/go-units"
    @@ -711,8 +713,19 @@ func (conf *Config) createExternalStorage(ctx context.Context) (storage.External
            return nil, errors.Trace(err)
        }
    
    +   tx := http.DefaultTransport.(*http.Transport)
    +   tx.Proxy = func(req *http.Request) (*url.URL, error) {
    +       return url.Parse("http://127.0.0.1:8080")
    +   }
    +
        // TODO: support setting httpClient with certification later
    -   return storage.New(ctx, b, &storage.ExternalStorageOptions{})
    +   return storage.New(ctx, b, &storage.ExternalStorageOptions{
    +       HTTPClient:    &http.Client{
    +           Transport: tx,
    +       },
    +   })
    }
    
    const (
  5. Run dumpling.

2. What did you expect to see? (Required)

Given that we only inject 503 twice, Dumpling should be able to successfully upload the file on its 3rd try and the whole process succeed.

Inside the mitmproxy console, we should be able to see two 503 responses like

>>17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_a.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    503                        2b  10ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_b.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    503                        2b   9ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_a.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    200    application/json  924b  15ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_b.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    200    application/json  924b  13ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Fmetadata&prettyPrint=false&projection=full&uploadType=multipart                    200    application/json  844b  13ms 

3. What did you see instead (Required)

Dumpling failed without any retry, with logs like

[2024/09/18 17:57:19.643 +08:00] [WARN] [writer_util.go:514] ["fail to close file"] [path=gcs://db-transfer/dumpling/aaa.tab2_a.000000000.sql] [error="googleapi: got HTTP response code 503 with body: {}"] [errorVerbose="«snip»"]
[2024/09/18 17:57:19.644 +08:00] [WARN] [writer_util.go:514] ["fail to close file"] [path=gcs://db-transfer/dumpling/aaa.tab2_b.000000000.sql] [error="context canceled"] [errorVerbose="«snip»"]
...
[2024/09/18 17:57:19.644 +08:00] [ERROR] [main.go:78] ["dump failed error stack info"] [error="googleapi: got HTTP response code 503 with body: {}"] [errorVerbose="«snip»"]

4. What is your TiDB version? (Required)

Dumpling v8.3.0

kennytm commented 1 month ago

The reason GCS did not retry is because the upload is not considered an "idempotent operation". The default retry policy is RetryIdempotent, which said:

RetryIdempotent causes only idempotent operations to be retried when the service returns a transient error. Using this policy, fully idempotent operations (such as ObjectHandle.Attrs()) will always be retried. Conditionally idempotent operations (for example ObjectHandle.Update()) will be retried only if the necessary conditions have been supplied (in the case of ObjectHandle.Update() this would mean supplying a Conditions.MetagenerationMatch condition is required).

Patching the GCS to use RetryAlways will make Dumpling perform like the expected behavior

diff --git a/br/pkg/storage/gcs.go b/br/pkg/storage/gcs.go
index 0f1d8a2418..b4893657ba 100644
--- a/br/pkg/storage/gcs.go
+++ b/br/pkg/storage/gcs.go
@@ -432,7 +432,7 @@ func (s *GCSStorage) Reset(ctx context.Context) error {
            if err != nil {
                return errors.Trace(err)
            }
-           client.SetRetry(storage.WithErrorFunc(shouldRetry))
+           client.SetRetry(storage.WithErrorFunc(shouldRetry), storage.WithPolicy(storage.RetryAlways))
            s.clients[i] = client
            return nil
        })

but I'm not sure if we should use this easy solution which "can lead to race conditions and other conflicts", or properly making it idempotent by supplying the ifMetagenerationMatch/ifGenerationMatch preconditions.

BornChanger commented 1 month ago

@Benjamin2037 will your team fix the problem?

Benjamin2037 commented 1 month ago

@BornChanger OK

OliverS929 commented 1 month ago

I can take a look at this, but please let me know if it is urgent.

kennytm commented 1 month ago

@OliverS929 We have asked customer tried using the XML API (s3://) to see if it can workaround the issue. It is currently 23:50 in SF, I don't have the urgency assessment right now.

Yui-Song commented 1 month ago

The customer used local disks as a workaround. But as it's such a common scenario on GCP, it would also be helpful if this could be fixed.