Closed f2prateek closed 4 years ago
cc @tritone @codyoss
Marking this as a question for now because we're solidifying our retry guidance/implementation for a couple other issues, too. Not sure what the right next step is for this quite yet.
And thank you for reporting this!
Is unexpected EOF
another similar error that could be retried? At least for certain types of requests perhaps. I observed a few of these errors when listing objects due to https://status.cloud.google.com/incident/zall/20003 I believe. Thank you!
This bug looks similar to https://github.com/googleapis/google-cloud-go/issues/1638 which will be solved https://code-review.googlesource.com/c/gocloud/+/48990 which I have just chimed in on. That linked CL whose motivation was already approved by the GCS team will retry 1net/url.Error.Temporary() -> true1 errors os that should also fix this bug report.
Is unexpected EOF another similar error that could be retried? At least for certain types of requests perhaps. I observed a few of these errors when listing objects due to https://status.cloud.google.com/incident/zall/20003 I believe. Thank you!
I am not too sure about that @horgh, great question! I think that though it should be a client retryable error by the caller, because ideally the transport layer would have retried the underlying error as many times then finally bubbled it up to the caller once all the retries were exhausted.
Here is a full reproducer without having to go out of the network
package main
import (
"context"
"fmt"
"net"
"net/url"
"sync"
"cloud.google.com/go/storage"
"google.golang.org/api/option"
)
func main() {
ln, err := net.Listen("tcp", ":0") // Reserve an address for us.
if err != nil {
panic(err)
}
addr := ln.Addr().String()
serverURL := "http://" + addr
var wg sync.WaitGroup
defer wg.Wait()
wg.Add(1)
go func() {
defer wg.Done()
conn, err := ln.Accept()
if err != nil {
panic(err)
}
conn.(*net.TCPConn).SetLinger(0)
// Immediately close it after accepting to toggle an ECONNRESET.
conn.Close()
}()
ctx := context.Background()
client, err := storage.NewClient(ctx, option.WithEndpoint(serverURL))
if err != nil {
panic(err)
}
if _, err := client.Bucket("foo").Attrs(ctx); err != nil {
oe := err.(*url.Error).Err.(*net.OpError)
fmt.Printf("Temporary: %t\nUnderlying error: %#v\n", oe.Temporary(), oe.Err)
panic(err)
}
}
which prints
Temporary: false
Underlying error: &os.SyscallError{Syscall:"read", Err:0x36}
panic: Get "http://[::]:54680/b/foo?alt=json&prettyPrint=false&projection=full": read tcp [::1]:54682->[::1]:54680: read: connection reset by peer
goroutine 1 [running]:
main.main()
/Users/emmanuelodeke/Desktop/openSrc/bugs/google-cloud-go/1832/repro.go:52 +0x45c
exit status 2
Hi all, I merged https://code-review.googlesource.com/c/gocloud/+/55970 which should resolve this issue for most operations (reads, object metadata operations, etc). However, retries for writes use a different code path involving logic in google-api-go-client so I am working on a separate fix for that. Will update here when that is ready.
Awesome, thank you!
Regarding unexpected EOF
- I'm not sure I follow the rationale why it's something the caller needs to take care of vs. the others. It'd simplify my life as a caller if it was retried too! For example internally I currently wrap all read type calls and retry them if it's that or if it's a net.OpError
. It sounds like that will still be necessary, which seems unfortunate!
@horgh are you experiencing this issue on writes only or on other operations as well?
The error io.ErrUnexpectedEOF
should be retried for writes at least at the level of the generated client; see https://github.com/googleapis/google-api-go-client/blob/326e17a21103f4ccf44ac1b40587ce7bcdd58b14/internal/gensupport/resumable.go#L162 . However there is an open bug regarding this at googleapis/google-api-go-client#449 that says that this is not working as intended. I'm working on fixing this as well.
It was read operations where I observed it happening. For example, I saw it when listing objects.
The fix for writes has been released in storage/v1.10.0 . I'm going to close this issue.
@horgh feel free to file another issue for unexpectedEOF
and I will look into whether that can be addressed as well (need to do some further research on my end).
I am having the same issue with storage/v1.31.0 running in GKE while writing. Unexpected failures with reset by peer.
I don't think the client library is handling connection reset errors
Client
Storage
Environment
ECS Service on AWS
Code
Expected behavior
I believe this a temporary error, and it should be retried by client library (following https://cloud.google.com/storage/docs/exponential-backoff). It should rarely be bubbled up to the caller - and only do so after exhausting retries.
Actual behavior
Sometimes, an error is returned by the client:
I've only seen this during the
w.Close
call so far. For our production environment, this happens on approximately 0.1% of the time for some projects.Screenshots
N/A
Additional context
N/A