google / upvote_py2

A multi-platform binary whitelisting solution
Apache License 2.0
452 stars 35 forks source link

transient errors #38

Open thehesiod opened 5 years ago

thehesiod commented 5 years ago

Have started seeing more of these: InternalTransientError: Temporary error in fetching URL: https://www.googleapis.com/bigquery/v2/projects/[PROJ]/datasets/gae_streaming/tables/Host/insertAll, please re-try at _get_fetch_result (/base/alloc/tmpfs/dynamic_runtimes/python27g/7e468a4e2dbc991a/python27/python27_lib/versions/1/google/appengine/api/urlfetch.py:446)

as well: TimeoutError: (<requests.packages.urllib3.contrib.appengine.AppEngineManager object at 0x2a583ad9b4d0>, DeadlineExceededError('Deadline exceeded while waiting for HTTP response from URL: https://www.googleapis.com/bigquery/v2/projects/[PROJ]/datasets/gae_streaming/tables/Host/insertAll',))

and ConnectionError: Connection closed unexpectedly by server at URL: https://www.googleapis.com/bigquery/v2/projects/[PROJ]/datasets/gae_streaming/tables/Host/insertAll

sounds like there's some missing retries. Does this mean that our bigquery tables will be missing entries? If these are retried then something should be changed for them not the show up in stackdriver error reporting.

thehesiod commented 5 years ago

another:

BadGateway: 502 POST https://www.googleapis.com/bigquery/v2/projects/[PROJ]/datasets/gae_streaming/tables/Host/insertAll: <!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 502 (Server Error)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px} </style> <a href=//www.google.com/><span id=logo aria-label=Google></span></a> <p><b>502.</b> <ins>That’s an error.</ins> <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds. <ins>That’s all we know.</ins>
at api_request (/base/data/home/apps/m~[PROJ]/20190130t125536.415774778466604461/external/gcloud_core_archive/google/cloud/_http.py:293)
at retry_target (/base/data/home/apps/m~[PROJ]/20190130t125536.415774778466604461/external/gcloud_api_core_archive/google/api_core/retry.py:177)
at retry_wrapped_func (/base/data/home/apps/m~[PROJ]/20190130t125536.415774778466604461/external/gcloud_api_core_archive/google/api_core/retry.py:260)
at _call_api (/base/data/home/apps/m~[PROJ]/20190130t125536.415774778466604461/external/gcloud_bigquery_archive/google/cloud/bigquery/client.py:311)
msuozzo commented 5 years ago

Given that these requests are issued from deferred tasks, I believe these failures will be retried by the queue itself 1.

chief8192 commented 5 years ago

Yeah all BigQuery insertions take place on the bigquery-streaming queue, which is configured to retry a lot: https://github.com/google/upvote/blob/master/upvote/gae/queue.yaml#L102-L103

thehesiod commented 5 years ago

cool, how could the code be updated to not log stackdriver errors that need to be investigated?

thehesiod commented 5 years ago

or maybe this is a stackdriver/queue configuration thing?

msuozzo commented 5 years ago

So you're getting separate "potential issues" filed because the failures are occurring at different places in the app engine stack. These should eventually peter out (there are only so many places an HTTP request can fail....maybe) but the better alternative would be to surround the potential request site(s) (link which is actually like 60 lines...) with a broad try..except that raises a new error or (maybe) re-raises the original.

With this, you should only get a single alert in stackdriver for a unique error (although you'll still see the errors occurring in any graphs or request metrics).

thehesiod commented 5 years ago

yes, so how would we change it so that only after it's run out of retries it's logged as an error? That's a lot of noise to keep track of. Seems like a general issue with tasks.