wal-g / wal-g

Archival and Restoration for databases in the Cloud
Other
3.26k stars 459 forks source link

Retry logic to S3 like storage #583

Open chvsunny opened 4 years ago

chvsunny commented 4 years ago

Hi,

We are backing up our PostgreSQL databases to on-prem S3 like storage. Backups are running fine. But intermittently, backups are failing due to network errors like "connection reset by peer". Next run completes successfully.

Can retry logic be implemented into wal-g to avoid backups failures for such kind of intermittent issues?

Thanks, Sunny.

TheKigen commented 4 years ago

Hi,

So I would like to point out that this issue is more relevant now that Backblaze has released a new S3 API. So far no options that I've set can fix it. I will always randomly get 400 Bad Requests from their API. I have opened a ticket with them about this but it would also be solved if WAL-G had retry logic to just retry the upload a configured number of times instead of throwing an error which any retry would require the entire backup to start from the very beginning.

ERROR: 2020/05/11 16:53:28.857629 failed to upload 'production/basebackups_005/base_0000000100008878000000B8/tar_partitions/part_042.tar.br' to bucket 'arandombucket': MultipartUpload: upload multipart failed
        upload id: 4_z979fc893dd51080977200f13_f201c26ceb26895e3_d20200511_m165256_c002_v0001124_t0001
caused by: InvalidRequest: <!doctype html><html lang="en"><head><title>HTTP Status 400 – Bad Request</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 400 – Bad Request</h1><hr class="line" /><p><b>Type</b> Exception Report</p><p><b>Message</b> Invalid character found in method name. HTTP method names must be tokens</p><p><b>Description</b> The server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).</p><p><b>Exception</b></p><pre>java.lang.IllegalArgumentException: Invalid character found in method name. HTTP method names must be tokens
        org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:415)
        org.apache.coyote.http11.BzContinueHandlingHttp11Processor.service(BzContinueHandlingHttp11Processor.java:286)
        org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
        org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:860)
        org.apache.tomcat.util.net.Nio2Endpoint$SocketProcessor.doRun(Nio2Endpoint.java:1686)
        org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        org.apache.tomcat.util.net.AbstractEndpoint.processSocket(AbstractEndpoint.java:1104)
        org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:599)
        org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:577)
        org.apache.tomcat.util.net.SecureNio2Channel$1.completed(SecureNio2Channel.java:969)
        org.apache.tomcat.util.net.SecureNio2Channel$1.completed(SecureNio2Channel.java:898)
        sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126)
        sun.nio.ch.Invoker$2.run(Invoker.java:218)
        sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        java.lang.Thread.run(Thread.java:748)
</pre><p><b>Note</b> The full stack trace of the root cause is available in the server logs.</p><hr class="line" /><h3>Apache Tomcat/9.0.30</h3></body></html>
        status code: 400, request id: e91f61bdb5c678ba, host id: aN2FmhzgxM+VkLzH9OMs50DfMMOFmXTP1
ERROR: 2020/05/11 16:53:28.857757 Unable to complete uploads
kethineni1 commented 4 years ago

i face same with openstack swift also. for wal-push, i am doing like this 'timeout wal-push %p' . otherwise wal-push hangs for ever. for backup, i was able to add Golang retry exception logic in wal-g and it appears to work but in end it still fails due to some other timeout.

src/github.com/wal-g/wal-g/internal

// begin new block err := try.Do(func(attempt int) (bool, error) { var err error err = uploader.Upload(path, NewNetworkLimitReader(pipeReader)) tracelog.ErrorLogger.Printf("##### %s could not upload '%s' due to NewNetworkLimitReader error\n", attempt, path ) // if err != nil { // time.Sleep(1 * time.Minute) // wait a minute // } return attempt < 30, err // try 5 times }) // end of new block

// original line // err := uploader.Upload(path, NewNetworkLimitReader(pipeReader))

i think the above change is making it retry but because of time lost, i am likely getting into error below:

src/github.com/wal-g/wal-g/internal func (tarBall *StorageTarBall) AwaitUploads() { tarBall.uploader.waitGroup.Wait() if tarBall.uploader.Failed.Load().(bool) { tracelog.ErrorLogger.Fatal("Unable to complete uploads") } }

Still need to figure how to increase that Wait or send it back to keep trying even then.

I switched to pgbackrest for now to keep some features 1. restart failed backup 2. Delta Restore which saves amazing time but at the cost of losing 1. Wal-g support of swift, s3 2. better brotli compression 3. page level backup 4. super simple model compared to pgbackrest

serv-merv commented 4 years ago

Hello. Is there any updates regarding to this issue? I have faced with that issue on my production and cannot do backups normally for my database. Backup-push always failing with connection reset by peer from S3 side

x4m commented 4 years ago

S3 SDK already has built-in retry logic. You can increase maxretries in the code https://github.com/wal-g/storages/blob/4c053af006cc777750f06c13f2e1746a23bc0613/s3/folder.go#L38 but at some point I think we should cover this with our internal retires too, just as backup-fetch is doing so... GCS storage suffers from this too

dtseiler commented 4 years ago

@x4m so if I hit that MultipartUpload: upload multipart failed error during backups, that means it's retried and failed 15 times already?

x4m commented 4 years ago

@dtseiler yes, I think so

TheKigen commented 3 years ago

So I wrote an alteration to wal-g/internal/uploader.go to add retry logic. The logic is very basic but works with Backblaze S3 like API.

Edit: Do not use this, see below.

func (uploader *Uploader) Upload(path string, content io.Reader) error {
    if uploader.tarSize != nil {
        content = &WithSizeReader{content, uploader.tarSize}
    }
    var err error
    for i := 0; i<30; i++ {
        err = uploader.UploadingFolder.PutObject(path, content)
        if err == nil {
            return nil
        }
        tracelog.ErrorLogger.Printf("Attempt %d for %s, "+tracelog.GetErrorFormatter()+"\n", i, path, err)
        time.Sleep(5 * time.Second)
    }
    uploader.Failed.Store(true)
    return err
}
x4m commented 3 years ago

@TheKigen but content is reader, you cannot just reread it... if it already misses some bytes - backup will be inconsistent.

TheKigen commented 3 years ago

@x4m I think this will work for now until retry logic is officially supported in WAL-G.

The problem with the below code is it has to load the entire buffer in memory to add seek behavior. Not ideal since these files can be over 1GB in size based on my use case. But ultimately its much better than repeatedly retrying the backup-push command until it works. My use case is a very large database that generates some 7000 or so parts.

NOTE: If you use this code this will load the entire part into memory, see above. Also, this bypasses the maximum upload bandwidth limitation configuration.

func (uploader *Uploader) Upload(path string, content io.Reader) error {
    var err error
    var buf bytes.Buffer
    // Not ideal to load this all into memory.  But limitation with getting seek require it.
    _, err = io.Copy(&buf, content)

    if err != nil {
        uploader.Failed.Store(true)
        tracelog.ErrorLogger.Printf("Failed getting content for %s, "+tracelog.GetErrorFormatter()+"\n", path, err)
        return err
    }

    reader := bytes.NewReader(buf.Bytes())

    if uploader.tarSize != nil {
        content = &WithSizeReader{content, uploader.tarSize}
    }

    for i := 1; i < 13; i++ {
        reader.Seek(0, 0)
        err := uploader.UploadingFolder.PutObject(path, reader)
        if err == nil {
            return nil
        }
        tracelog.ErrorLogger.Printf("Attempt %d for %s, "+tracelog.GetErrorFormatter()+"\n", i, path, err)
        time.Sleep(time.Duration(i*5) * time.Second)
    }
    uploader.Failed.Store(true)
    return err
}
x4m commented 3 years ago

Maybe you could just increase MaxRetries in S3 storage? It does exactly this: breaks reader into 20Mb multiparts, uploads each one with retries, joins the result. Also check this code https://github.com/wal-g/storages/blob/master/gcs/folder.go#L276 Id does so because GCS SDK does not have these retries like AWS SDK.

TheKigen commented 3 years ago

The problem there is the errors encountered by Backblaze S3 like API result in the AWS S3 SDK fatal errors without retries.

https://www.backblaze.com/blog/b2-503-500-server-error/

Some of these errors require re-upload from the beginning.

Edit: Also, thank you very much for pointing out the seek behavior. I'm not familiar with Go at all.

x4m commented 3 years ago

I'd be much happier if we had 1st class Backblaze support. I know that their natural API is somewhat superior to ancient S3 (didn't really benchmark though). Do they have Go SDK?

TheKigen commented 3 years ago

Looks like they only have Java and Python at the moment.

https://github.com/Backblaze

Some people are/were trying to add a Golang implementation of B2 API.

https://github.com/kurin/blazer https://github.com/kothar/go-backblaze

Though it appears the maintainers of the above are trying to abandon the implementations since S3 API compatibility was added.

Minio used to use Blazer for B2 connections. https://github.com/minio/minio/pull/9547/files

x4m commented 3 years ago

Also, I think it must be possible to add an error to retriable in S3 SDK.