pivotal-cf / om

General command line utility for working with VMware Tanzu Operations Manager
Apache License 2.0
134 stars 103 forks source link

Intermittent `POST .../api/v0/available_products: EOF` when uploading tile #240

Closed ljfranklin closed 5 years ago

ljfranklin commented 6 years ago

We've seen the following error 22 times in the last 45 days:

processing product
beginning product upload to Ops Manager
 10.64 GiB / 12.60 GiB [==================================>------]  84.44% 1m34s
could not execute "upload-product": failed to upload product: could not make api request to available_products endpoint: Post https://pcf-optional.abalone-scale.gcp.releng.cf-app.com/api/v0/available_products: EOF

We've seen this across multiple tiles (PAS & PASW) and different IaaS (GCP, Azure, AWS) and different OpsMgr versions (2.0-2.3). We initially suspected a recent PR which made file uploads more performant but we saw the error a few times prior to merging the PR.

One potential solution would be to retry on Temporary() networking errors. The stdlib networking packages often return net.Error which has a Temporary() method you can use to check whether an error instance might be retryable: https://golang.org/pkg/net/#Error. We could add retry logic for these errors in the http client in om: https://github.com/pivotal-cf/om/blob/4d5f262bb6a1006f1e2af2754ee4e24707b5e4f3/network/unauthenticated_client.go. We suspect EOF is a temporary error but we're not positive.

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

ljfranklin commented 6 years ago

Merged in a PR to add retries on networking failures here, this feature is available in om v0.41.0. Going to optimistically close this out but will re-open if we still see the error.

heycait commented 6 years ago

This is still an issue. Currently seeing it in IST2.0 and ERT next on GCP.

fredwangwang commented 5 years ago

this is identified as a network issue, and we found the retry client probably wont work for uploading case since the payload would not be complete on the second try.

@ljfranklin are we able to close this issue?

ljfranklin commented 5 years ago

@fredwangwang agreed the initial retry client PR doesn't help the issue. But we still see this issue frequently in our CI and would like to fix it. I moved our story up in our backlog to take another look: https://www.pivotaltracker.com/story/show/160181845

ljfranklin commented 5 years ago

Additional context in this slack convo: https://pivotal.slack.com/archives/C5V956L13/p1539014781000100

eitansuez commented 5 years ago

perhaps it's good that retry logic was added. but what if it fails three times in a row?

jtarchie commented 5 years ago

@eitansuez: You have to retry. At some point we have to call it on the number things we can attempt to do.