movio / bramble

A federated GraphQL API gateway
https://movio.github.io/bramble/
MIT License
497 stars 55 forks source link

Socket hang up with long running request to service #228

Open codedge opened 8 months ago

codedge commented 8 months ago

Hey!

I experience a weird behaviour when having long running requests in my service connected to the gateway. I run a PHP-based service behind the gateway, that sometimes needs up to 45s to return the response. In 90% of the cases the response is not returned by the gateway and instead I get a Socket hang up back.

I already enabled the Limits plugin and put this there

{
      "name": "limits",
      "config": {
        "max-response-time": "120s",
        "max-request-bytes": 1000000
      }
}

This removes the initial reached timeout error message, but still I have the problem that the connection between the gateway and the service somehow gets lost.

I am 100% sure, that the response is correct and is returned by the (backend) service properly. When calling the GraphQL endpoint of the backend service directly, there is no issue at all.

Does that somehow sound familiar to you or any hint where to look?

Thanks!

pkqk commented 8 months ago

Hi @codedge, can you post a copy of the response you're getting? The string Socket hang up seem to show up when I search the go stdlib and you mention later it's a reached timeout.

Does it log the request when it fails?

codedge commented 8 months ago

Sorry for the confusion.

1. Resolving the reached timeout

At first I got a reached timeout error. This error was directly visible inside the logs of Bramble. I figured out, that by using the limits plugin with the above mentioned configuration, I can get around this error.

This is solved ✔️

2. The Socket hang up problem

This error is returned by curl (or Postman) or any other GraphQL client. There is no other response.

Error in curl

2023-11-13_224641

Error in Postman

2023-11-13_225222

I tend to say this is some keepAlive/idle timeout problem.

I also found this link, which talks about the net.http.Server.WriteTimeout.

It logs the request towards the backend service, but it does not log the response coming back.

codedge commented 8 months ago

.. and I can confirm, that changing the WriteTimeout to f. ex. 60

func runHandler(ctx context.Context, wg *sync.WaitGroup, name, addr string, handler http.Handler) {
    srv := &http.Server{
        Addr:         addr,
        Handler:      handler,
        ReadTimeout:  5 * time.Second,
        WriteTimeout: 60 * time.Second,
        IdleTimeout:  120 * time.Second,
    }
        // ...
}

everything works flawlessly.

Do you think you can make this configurable via the limits plugin?

Update

I can see that there are three server instances runnning - public, private, metrics. I guess in my case only the one for public is the relevant one.

Ideally the user is able to configure this for each of these three.

I would create a PR (if you don't find time).

pkqk commented 8 months ago

Thanks for doing the debugging @codedge, that makes sense, if the write timeout is set to 10s by default it will be closing the socket before your service has responded.

It would be useful to have bramble craft a timeout response in that situation but we can make the socket settings tuneable as well.

The public and private muxs are there so you can have plugins apply different middleware to an published endpoint and an internal endpoint, i.e. we have auth on the public mux which is exposed via ingress to our webapp and the private mux serves backend services which are inside our VPC.

codedge commented 7 months ago

Is there a release planned to include the new configuration?