storage: handle connection reset errors

f2prateek commented 4 years ago

I don't think the client library is handling connection reset errors

Client

Storage

Environment

ECS Service on AWS

Code

package main

func main() {
        bucket := "..."
        path := "..."
        r := bytes.NewReader("...")

    w := s.Bucket(bucket).Object(path).NewWriter(ctx)
    w.ContentType = "application/gzip"
        if _, err = io.Copy(w, in.Body); err != nil {
        panic(err)
    }

    if err = w.Close(); err != nil {
        panic(err)
    }
}

Expected behavior

I believe this a temporary error, and it should be retried by client library (following https://cloud.google.com/storage/docs/exponential-backoff). It should rarely be bubbled up to the caller - and only do so after exhausting retries.

Actual behavior

Sometimes, an error is returned by the client:

Post "https://www.googleapis.com/upload/storage/v1/b/{omitted}/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart": read tcp 172.17.0.21:52100->172.217.3.170:443: read: connection reset by peer

I've only seen this during the w.Close call so far. For our production environment, this happens on approximately 0.1% of the time for some projects.

Screenshots

N/A

Additional context

N/A

tbpg commented 4 years ago

cc @tritone @codyoss

Marking this as a question for now because we're solidifying our retry guidance/implementation for a couple other issues, too. Not sure what the right next step is for this quite yet.

tbpg commented 4 years ago

And thank you for reporting this!

horgh commented 4 years ago

Is unexpected EOF another similar error that could be retried? At least for certain types of requests perhaps. I observed a few of these errors when listing objects due to https://status.cloud.google.com/incident/zall/20003 I believe. Thank you!

odeke-em commented 4 years ago

This bug looks similar to https://github.com/googleapis/google-cloud-go/issues/1638 which will be solved https://code-review.googlesource.com/c/gocloud/+/48990 which I have just chimed in on. That linked CL whose motivation was already approved by the GCS team will retry 1net/url.Error.Temporary() -> true1 errors os that should also fix this bug report.

odeke-em commented 4 years ago

Is unexpected EOF another similar error that could be retried? At least for certain types of requests perhaps. I observed a few of these errors when listing objects due to https://status.cloud.google.com/incident/zall/20003 I believe. Thank you!

I am not too sure about that @horgh, great question! I think that though it should be a client retryable error by the caller, because ideally the transport layer would have retried the underlying error as many times then finally bubbled it up to the caller once all the retries were exhausted.

odeke-em commented 4 years ago

Here is a full reproducer without having to go out of the network

package main

import (
    "context"
    "fmt"
    "net"
    "net/url"
    "sync"

    "cloud.google.com/go/storage"
    "google.golang.org/api/option"
)

func main() {
    ln, err := net.Listen("tcp", ":0") // Reserve an address for us.
    if err != nil {
        panic(err)
    }

    addr := ln.Addr().String()
    serverURL := "http://" + addr

    var wg sync.WaitGroup
    defer wg.Wait()

    wg.Add(1)
    go func() {
        defer wg.Done()

        conn, err := ln.Accept()
        if err != nil {
            panic(err)
        }
        conn.(*net.TCPConn).SetLinger(0)
        // Immediately close it after accepting to toggle an ECONNRESET.
        conn.Close()
    }()

    ctx := context.Background()
    client, err := storage.NewClient(ctx, option.WithEndpoint(serverURL))
    if err != nil {
        panic(err)
    }
    if _, err := client.Bucket("foo").Attrs(ctx); err != nil {
        oe := err.(*url.Error).Err.(*net.OpError)
        fmt.Printf("Temporary: %t\nUnderlying error: %#v\n", oe.Temporary(), oe.Err)
        panic(err)
    }
}

which prints

Temporary: false
Underlying error: &os.SyscallError{Syscall:"read", Err:0x36}
panic: Get "http://[::]:54680/b/foo?alt=json&prettyPrint=false&projection=full": read tcp [::1]:54682->[::1]:54680: read: connection reset by peer

goroutine 1 [running]:
main.main()
    /Users/emmanuelodeke/Desktop/openSrc/bugs/google-cloud-go/1832/repro.go:52 +0x45c
exit status 2

tritone commented 4 years ago

Hi all, I merged https://code-review.googlesource.com/c/gocloud/+/55970 which should resolve this issue for most operations (reads, object metadata operations, etc). However, retries for writes use a different code path involving logic in google-api-go-client so I am working on a separate fix for that. Will update here when that is ready.

horgh commented 4 years ago

Awesome, thank you!

Regarding unexpected EOF - I'm not sure I follow the rationale why it's something the caller needs to take care of vs. the others. It'd simplify my life as a caller if it was retried too! For example internally I currently wrap all read type calls and retry them if it's that or if it's a net.OpError. It sounds like that will still be necessary, which seems unfortunate!

tritone commented 4 years ago

@horgh are you experiencing this issue on writes only or on other operations as well?

The error io.ErrUnexpectedEOF should be retried for writes at least at the level of the generated client; see https://github.com/googleapis/google-api-go-client/blob/326e17a21103f4ccf44ac1b40587ce7bcdd58b14/internal/gensupport/resumable.go#L162 . However there is an open bug regarding this at googleapis/google-api-go-client#449 that says that this is not working as intended. I'm working on fixing this as well.

horgh commented 4 years ago

It was read operations where I observed it happening. For example, I saw it when listing objects.

tritone commented 4 years ago

The fix for writes has been released in storage/v1.10.0 . I'm going to close this issue.

@horgh feel free to file another issue for unexpectedEOF and I will look into whether that can be addressed as well (need to do some further research on my end).

dstiliadis commented 1 year ago

I am having the same issue with storage/v1.31.0 running in GKE while writing. Unexpected failures with reset by peer.

googleapis / google-cloud-go

storage: handle connection reset errors #1832