Heroku error H27: Client Request Interrupted

mrkaye97 commented 1 year ago

Hey everyone, sorry for the clickbait issue title :)

We've got a plumber API running in production. Our code for one of the endpoints looks like this (cleaned up some of the logging messages and stuff for example purposes):

function(req, res, input) {

  start <- Sys.time()
  send_log(req = req, message = paste("Doing something. There were", length(input), "things requested."))

  future_promise({do_something(input)}) %>%
    then(
      onFulfilled = function(result) {
        result_length <- length(result)

        send_log(
          req = req,
          message = paste(
            "Did something. Returning",
            result_length,
            "things"
          ),
          time = timediff_ms(start)
        )

        return(result)
      },
      onRejected = function(result) {
        send_log(
          "Failed to do the thing. Got into onRejected block.",
          level = "error",
          req = req
        )

        return(result)
      }
    )
}

We're running on Heroku, and intermittently are getting H27 errors. They happen pretty often (too often relative to some of our (e.g.) Haskell services in production for it to be networking issues), but it's not clear to me exactly what's going on. Here's an example of one of the errors:

2022-08-23T21:19:48.668348+00:00 app[web.1]: at=info request_id=17835ef0-f4a6-49a2-b72a-72474d69a634 message=Doing something. There were 50 things requested.
2022-08-23T21:19:52.283791+00:00 app[web.1]: at=info request_id=17835ef0-f4a6-49a2-b72a-72474d69a634 message=Did something. Returning 50 things. time=3614.435ms
2022-08-23T21:19:52.312431+00:00 heroku[router]: sock=client at=warning code=H27 desc="Client Request Interrupted" method=POST path="/do/something" host=url.herokuapp.com request_id=17835ef0-f4a6-49a2-b72a-72474d69a634 fwd="54.89.146.252" dyno=web.1 connect=0ms service=3663ms status=499 bytes= protocol=https

my suspicion is that something wonky is going on with passing data back and forth between the main process and the worker, but I'm really not sure.

So the real question: What's the best way to add logging to test this theory? What I'd love to be able to do is capture the result of the future_promise, check its length() (e.g.), and log that. But I'm not sure how to do that.

Separately, am curious about best practices for this kind of stuff in general -- I haven't seen much discussion for how to add logging, how to debug, etc.

Thanks a bunch for any help!

jcheng5 commented 1 year ago

I imagine this would be much more prevalent if your plumber APIs have greater latency than your Haskell services. Wait--I just read the link you sent and it specifically says

You might think this an indication that your code could be optimised and made to run faster, reducing your chances that users abandon the site. However, this is very unlikely the issue. H27s can be a bit frustrating to solve since the cause is from the client and not the servers, but it's not considered a critical error.

It also specifically says this is from the client dropping the connection, not the server. So I'm not sure what you could log to help inform that. Maybe the client IP address and user-agent, and look for patterns there?

I'm also not sure what you mean by "capture the result of the future_promise, check its length()"--isn't that exactly what is happening when you log "Did something. Returning 50 things."?

mrkaye97 commented 1 year ago

hey @jcheng5, thanks for following up!

re: latency -- it depends a lot on the call. some of the API calls are returning in ~10ms to the plumber API (could check our Sumo logs, but if I had to guess I'd ballpark it at >90% of them). The async ones return in more like ~100ms. The Haskell service p50 is maybe 3ms, but probably 1/3 of the requests take more like ~75-100ms. So it's tough to compare apples to apples since it's happening on different services, different endpoints, etc. but I'm not convinced it's a latency issue.

re: users + "it comes from the client" -- the "client" in this case is our Rails app, not an actual user. so I doubt that this is getting caused by a user abandoning the site, having spotty wifi, etc. could be wrong, but that wouldn't make much sense to me.

re: capture the result -- (disclaimer: I'm very much not a future or promises expert 😄) I was under the impression that I could not do something like this:

function(req, res) {
  x <- future_promise({do_something()}) %>%
      then({do_another_thing()})

  message(length(x))

  return(x)
}

again, my promises knowledge is pretty limited, but I think I'm not totally understanding if logging from inside of the then call is somehow different than logging from outside of it. My intuition was that it was possible that somehow as the code is being run async, we get some logging back that everything looks good. but then when the actual result is returned (which I was thinking is happening at some undetermined later time), maybe something has gone wrong? (timeout, something went wrong passing data back from the worker to the main thread, something else). again, my knowledge is really not great, but I'm just trying to figure out what else might be going wrong here since we have such a large proportion of requests -- at least relatively speaking -- seemingly just dropping.

also, this seems to be getting to the point of being a plumber issue more than a promises issue potentially. if it keeps down this path, I'll open an issue over there

Thanks again!

mrkaye97 commented 1 year ago

update: it appears we fixed the issue by changing from a plan(multisession, workers = 2) to plan(multicore, workers = I(1)). the machine has 2 available cores.

Not sure exactly why this would change anything but I imagine it's related to the plan using all of the cores on the machine? My intuitions are very meh though.

Closing

rstudio / promises

Heroku error H27: Client Request Interrupted #85