moleculerjs / moleculer-web

:earth_africa: Official API Gateway service for Moleculer framework
http://moleculer.services/docs/moleculer-web.html
MIT License
291 stars 118 forks source link

Adjusting keepaliveTimeout for the http server #226

Closed morkeleb closed 3 years ago

morkeleb commented 3 years ago

We're seeing random 502 timeouts from the ALB on AWS, when pointing it to the web-gateway.

After a series of different adjustments and changes we've now reached the point where I think its between the http server and the ALB. It seems that given assumption, theres quite a bit of people experiencing the same issue with the AWS loadbalancers and node webservers (be they express or http based).

The workaround seems to be to adjust the keepAliveTimeouts so that they interact correctly with the loadbalancers.

The following posts explain the issue deeper:

https://shuheikagawa.com/blog/2019/04/25/keep-alive-timeout/ https://ngchiwa.medium.com/aws-alb-express-502-ramdom-daaabaafb7cf https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/

I'm trying to adjust those settings in our gateway to ensure that we solve the 502 errors.

However I cannot find where to inject the settings. So I need either an example or an entry to set more settings to the server.

AndreMaz commented 3 years ago

You can access the server in created() function of api-gateway. image

morkeleb commented 3 years ago

Thank you, going to run a version with the timeouts in place now.

morkeleb commented 3 years ago

This is the added config


  created() {
    this.server.keepAliveTimeout = 65000; // Ensure all inactive connections are terminated by the ALB, by setting this a few seconds higher than the ALB idle timeout
    this.server.headersTimeout = 66000; // Ensure the headersTimeout is set higher than the keepAliveTimeout due to this nodejs regression bug: https://github.com/nodejs/node/issues/27363
  },
AndreMaz commented 3 years ago

@morkeleb please tell me if it worked. If so, I'll add this to FAQ

morkeleb commented 3 years ago

Of course. Initial tests are looking good. There was a timeout about an hour ago, but it doesn't follow the same pattern. Investigating if it's this issue or something else. That said: There is a significant drop in random timeouts.

morkeleb commented 3 years ago

The other timeouts ever related to a liveness check fail in kuberentes. So they are explained.

It seems this works, it's been running for 24 hours now and the only timeout incident is accounted for due to a restart.

morkeleb commented 3 years ago

This is working for us now. Thanks for the help @AndreMaz

AndreMaz commented 3 years ago

Awesome :+1: