Closed ljfranklin closed 5 years ago
We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.
The labels on this github issue will be updated when the story is started.
Merged in a PR to add retries on networking failures here, this feature is available in om
v0.41.0. Going to optimistically close this out but will re-open if we still see the error.
This is still an issue. Currently seeing it in IST2.0 and ERT next on GCP.
this is identified as a network issue, and we found the retry client probably wont work for uploading case since the payload would not be complete on the second try.
@ljfranklin are we able to close this issue?
@fredwangwang agreed the initial retry client PR doesn't help the issue. But we still see this issue frequently in our CI and would like to fix it. I moved our story up in our backlog to take another look: https://www.pivotaltracker.com/story/show/160181845
Additional context in this slack convo: https://pivotal.slack.com/archives/C5V956L13/p1539014781000100
perhaps it's good that retry logic was added. but what if it fails three times in a row?
@eitansuez: You have to retry. At some point we have to call it on the number things we can attempt to do.
We've seen the following error 22 times in the last 45 days:
We've seen this across multiple tiles (PAS & PASW) and different IaaS (GCP, Azure, AWS) and different OpsMgr versions (2.0-2.3). We initially suspected a recent PR which made file uploads more performant but we saw the error a few times prior to merging the PR.
One potential solution would be to retry on
Temporary()
networking errors. The stdlib networking packages often returnnet.Error
which has aTemporary()
method you can use to check whether an error instance might be retryable: https://golang.org/pkg/net/#Error. We could add retry logic for these errors in the http client inom
: https://github.com/pivotal-cf/om/blob/4d5f262bb6a1006f1e2af2754ee4e24707b5e4f3/network/unauthenticated_client.go. We suspectEOF
is a temporary error but we're not positive.