uNetworking / uWebSockets.js

μWebSockets for Node.js back-ends :metal:
Apache License 2.0
7.76k stars 563 forks source link

Debugging an issue on AWS ALB 502s #179 #939

Closed jsonmorris closed 1 year ago

jsonmorris commented 1 year ago

Hey folks, I'm using hyper-express behind an AWS ALB and running into the very. well. documented. 502 problem with upstream TCP disconnects. We are seeing about 0.05% of requests being dropped with a 502 on the ALB.

Looking to the community for help debugging uSockets or uWebSockets to see where there's a problem. As it stands, there's not a lot of tracing, logging, etc that is helpful for me.

Based on the following evidence, it seems that there's an issue with uSockets closing open connections with TCP RST/FIN before the AWS ALB timeout has elapsed. Interesting to note the target_processing_time here, where some requests look to be canceled near instantly 0.001s where there are in flight for a considerable amount of time, 8s - 12s.

logs table |`request_processing_time`|`target_processing_time`|`response_processing_time`|`elb_status_code`|`received_bytes`|`sent_bytes`| |--|--|--|--|--|--| |0.003|0.0|-1.0|502|525|155| |0.001|8.932|-1.0|502|2206|679| |0.003|0.0|-1.0|502|525|155| |0.026|0.001|-1.0|502|240|277| |0.004|0.0|-1.0|502|359|277| |0.005|0.001|-1.0|502|250|674| |0.002|0.0|-1.0|502|527|155| |0.003|0.0|-1.0|502|527|155| |0.008|0.001|-1.0|502|173|277| |0.004|0.001|-1.0|502|204|679| |0.002|10.39|-1.0|502|23172|679| |0.003|0.0|-1.0|502|196|679| |0.007|0.0|-1.0|502|527|155| |0.003|0.0|-1.0|502|525|155| |0.007|0.001|-1.0|502|527|155| |0.005|0.001|-1.0|502|206|679| |0.005|0.001|-1.0|502|525|155| |0.003|0.0|-1.0|502|185|272| |0.001|10.031|-1.0|502|23173|679| |0.005|8.711|-1.0|502|2285|679| |0.001|3.608|-1.0|502|2553|277| |0.001|11.615|-1.0|502|23172|679| |0.005|0.001|-1.0|502|261|679| |0.001|6.86|-1.0|502|297|272| |0.004|0.0|-1.0|502|525|155| |0.004|0.0|-1.0|502|527|155| |0.01|0.001|-1.0|502|234|277| |0.001|0.0|-1.0|502|2575|277| |0.001|11.491|-1.0|502|297|272| |0.001|0.375|-1.0|502|599|679| |0.003|0.0|-1.0|502|359|277| |0.004|0.001|-1.0|502|231|679| |0.004|0.001|-1.0|502|527|155| |0.001|0.0|-1.0|502|2286|679| |0.003|0.0|-1.0|502|361|277| |0.009|0.001|-1.0|502|527|155| |0.004|0.0|-1.0|502|525|155| |0.003|0.0|-1.0|502|230|679| |0.006|0.001|-1.0|502|316|679| |0.011|0.001|-1.0|502|326|277| |0.006|0.001|-1.0|502|525|155| |0.005|0.001|-1.0|502|221|272| |0.065|1.691|-1.0|502|2193|679| |0.003|0.0|-1.0|502|224|272| |0.003|0.0|-1.0|502|220|674| |0.08|9.732|-1.0|502|2280|679| |0.081|8.272|-1.0|502|3180|679| |0.001|3.747|-1.0|502|4540|679| |0.006|11.332|-1.0|502|1940|679| |0.005|0.001|-1.0|502|243|679| |0.006|0.001|-1.0|502|230|679| |0.004|0.0|-1.0|502|316|679| |0.005|0.001|-1.0|502|525|155| |0.008|0.0|-1.0|502|527|155| |0.005|0.001|-1.0|502|256|674| |0.004|0.001|-1.0|502|527|155| |0.006|0.001|-1.0|502|230|679| |0.006|0.001|-1.0|502|316|679| |0.003|0.001|-1.0|502|214|674| |0.004|0.001|-1.0|502|55|277| |0.003|0.0|-1.0|502|231|679| |0.037|0.0|-1.0|502|317|679| |0.01|0.001|-1.0|502|43|277| |0.003|0.0|-1.0|502|159|277| |0.006|0.001|-1.0|502|234|679| |0.001|0.0|-1.0|502|527|155| |0.003|0.0|-1.0|502|204|674| |0.001|0.001|-1.0|502|204|679| |0.006|0.0|-1.0|502|274|277| |0.004|0.001|-1.0|502|231|679| |0.005|0.001|-1.0|502|317|679| |0.003|0.0|-1.0|502|361|277| |0.004|0.0|-1.0|502|271|679| |0.003|0.0|-1.0|502|317|679| |0.003|0.0|-1.0|502|231|679| |0.071|0.007|-1.0|502|1961|679| |0.008|0.0|-1.0|502|190|272| |0.005|0.001|-1.0|502|234|277| |0.006|0.001|-1.0|502|215|679| |0.016|0.001|-1.0|502|326|277| |0.009|0.0|-1.0|502|236|679| |0.006|0.0|-1.0|502|525|155| |0.006|0.001|-1.0|502|525|155| |0.001|8.012|-1.0|502|297|272| |0.005|0.001|-1.0|502|222|277| |0.003|0.0|-1.0|502|361|277| |0.005|0.001|-1.0|502|317|679| |0.014|0.001|-1.0|502|231|679| |0.029|0.001|-1.0|502|240|277| |0.005|0.001|-1.0|502|184|277| |0.003|0.0|-1.0|502|233|277| |0.006|0.0|-1.0|502|234|679| |0.002|0.0|-1.0|502|525|155| |0.004|0.0|-1.0|502|42|272| |0.004|0.0|-1.0|502|136|277| |0.004|0.0|-1.0|502|203|277| |0.007|0.001|-1.0|502|219|674| |0.007|0.0|-1.0|502|527|155| |0.001|0.001|-1.0|502|216|679| |0.004|0.0|-1.0|502|231|679| |0.003|0.0|-1.0|502|317|679| |0.003|0.0|-1.0|502|527|155| |0.006|0.001|-1.0|502|229|679| |0.003|0.0|-1.0|502|525|155| |0.007|0.0|-1.0|502|68|277| |0.001|0.0|-1.0|502|105|277| |0.003|0.0|-1.0|502|205|679| |0.006|0.001|-1.0|502|205|679| |0.004|0.001|-1.0|502|204|679| |0.001|0.0|-1.0|502|527|155| |0.091|11.21|-1.0|502|24149|679| |0.001|8.434|-1.0|502|23173|679| |0.062|8.697|-1.0|502|2037|679| |0.001|8.444|-1.0|502|2821|277| |0.005|0.001|-1.0|502|236|679| |0.003|0.0|-1.0|502|527|155| |0.012|0.0|-1.0|502|2069|679| |0.009|0.002|-1.0|502|274|277| |0.071|0.001|-1.0|502|2000|679| |0.006|0.001|-1.0|502|316|679| |0.002|0.0|-1.0|502|230|679| |0.006|0.001|-1.0|502|242|272| |0.005|0.001|-1.0|502|206|272| |0.006|0.0|-1.0|502|325|277| |0.001|4.815|-1.0|502|2123|679| |0.003|0.0|-1.0|502|42|277| |0.003|0.0|-1.0|502|158|277| |0.005|0.001|-1.0|502|204|674| |0.053|0.001|-1.0|502|240|277| |0.006|0.001|-1.0|502|525|155| |0.003|0.0|-1.0|502|316|679| |0.004|0.0|-1.0|502|230|679| |0.008|0.001|-1.0|502|254|272| |0.009|0.001|-1.0|502|525|155| |0.006|0.001|-1.0|502|305|674| |0.005|11.45|-1.0|502|1940|679| |0.001|9.396|-1.0|502|2032|679| |0.003|0.0|-1.0|502|158|277| |0.005|0.001|-1.0|502|527|155| |0.003|0.0|-1.0|502|230|679| |0.006|0.001|-1.0|502|316|679| |0.001|1.124|-1.0|502|2280|679| |0.009|0.001|-1.0|502|540|679| |0.008|0.001|-1.0|502|504|679| |0.003|0.0|-1.0|502|100|277| |0.003|0.0|-1.0|502|200|679| |0.004|0.0|-1.0|502|231|679| |0.004|0.001|-1.0|502|317|679| |0.009|0.0|-1.0|502|68|277| |0.003|0.0|-1.0|502|263|679| |0.004|0.0|-1.0|502|230|679| |0.008|0.0|-1.0|502|234|679| |0.001|2.543|-1.0|502|2110|679| |0.001|9.208|-1.0|502|2291|679| |0.003|0.0|-1.0|502|359|277| |0.005|0.001|-1.0|502|243|674| |0.004|0.001|-1.0|502|527|155| |0.003|0.0|-1.0|502|527|155| |0.004|0.0|-1.0|502|527|155| |0.003|0.0|-1.0|502|263|679| |0.001|4.743|-1.0|502|2069|679| |0.001|9.521|-1.0|502|1946|679| |0.007|0.0|-1.0|502|254|272| |0.063|10.371|-1.0|502|2204|679| |0.001|6.727|-1.0|502|3438|679| |0.001|7.943|-1.0|502|23172|679| |0.009|0.001|-1.0|502|158|277| |0.005|0.001|-1.0|502|527|155| |0.148|0.056|-1.0|502|2241|679| |0.005|0.001|-1.0|502|43|272| |0.005|0.001|-1.0|502|254|272| |0.016|0.001|-1.0|502|249|679| |0.01|0.0|-1.0|502|527|155| |0.008|0.0|-1.0|502|527|155| |0.006|0.001|-1.0|502|43|272| |0.003|0.0|-1.0|502|360|674| |0.003|0.0|-1.0|502|380|674| |0.004|0.0|-1.0|502|370|674| |0.003|0.0|-1.0|502|385|674| |0.003|0.0|-1.0|502|230|679| |0.003|0.0|-1.0|502|316|679| |0.002|0.0|-1.0|502|211|679| |0.003|0.0|-1.0|502|525|155| |0.006|0.0|-1.0|502|527|155| |0.005|0.001|-1.0|502|159|277| |0.007|0.0|-1.0|502|159|277| |0.006|0.001|-1.0|502|254|272| |0.003|0.0|-1.0|502|243|679| |0.008|0.0|-1.0|502|230|679| |0.007|0.0|-1.0|502|316|679| |0.003|0.0|-1.0|502|525|155| |0.004|0.0|-1.0|502|231|679| |0.001|0.001|-1.0|502|143|277| |0.003|0.0|-1.0|502|159|277| |0.001|10.109|-1.0|502|2248|679| |0.001|0.0|-1.0|502|3135|679| |0.006|0.001|-1.0|502|159|277| |0.003|0.0|-1.0|502|43|277| |0.005|0.0|-1.0|502|203|679| |0.009|0.003|-1.0|502|259|277| |0.005|0.001|-1.0|502|527|155| |0.004|0.001|-1.0|502|219|679| |0.008|0.0|-1.0|502|361|277| |0.008|0.001|-1.0|502|241|679| |0.01|0.0|-1.0|502|241|679| |0.003|0.0|-1.0|502|253|679| |0.003|0.0|-1.0|502|262|679| |0.004|0.0|-1.0|502|241|679| |0.003|0.0|-1.0|502|283|679| |0.007|0.0|-1.0|502|319|679| |0.003|0.0|-1.0|502|263|679| |0.002|0.0|-1.0|502|277|679| |0.003|0.0|-1.0|502|230|679| |0.005|0.001|-1.0|502|527|155| |0.006|0.001|-1.0|502|296|277| |0.007|0.0|-1.0|502|382|277| |0.003|0.0|-1.0|502|193|277| |0.004|0.0|-1.0|502|101|277| |0.005|0.0|-1.0|502|271|679| |0.002|0.0|-1.0|502|2313|679| |0.007|0.0|-1.0|502|243|679| |0.003|0.0|-1.0|502|230|679| |0.003|0.0|-1.0|502|42|277| |0.006|0.001|-1.0|502|158|277| |0.003|0.0|-1.0|502|527|155| |0.003|0.0|-1.0|502|317|679| |0.005|0.001|-1.0|502|231|679| |0.001|0.0|-1.0|502|2053|679| |0.003|0.0|-1.0|502|525|155| |0.001|11.19|-1.0|502|23173|679| |0.001|9.81|-1.0|502|2213|679| |0.001|7.847|-1.0|502|3473|679| |0.071|0.388|-1.0|502|4338|679| |0.001|0.001|-1.0|502|216|679| |0.001|1.029|-1.0|502|2083|679| |0.009|0.001|-1.0|502|220|674| |0.041|0.001|-1.0|502|239|277| |0.001|10.941|-1.0|502|1620|277| |0.07|7.112|-1.0|502|1999|679| |0.011|0.0|-1.0|502|527|155| |0.009|0.001|-1.0|502|527|155| |0.003|0.0|-1.0|502|699|679| |0.003|10.505|-1.0|502|2251|679| |0.004|0.0|-1.0|502|263|679| |0.001|7.427|-1.0|502|1620|277| |0.001|6.66|-1.0|502|2180|679| |0.007|0.0|-1.0|502|326|277| |0.004|0.0|-1.0|502|222|272| |0.004|0.0|-1.0|502|193|272| |0.005|0.0|-1.0|502|231|679| |0.006|0.001|-1.0|502|317|679| |0.004|0.0|-1.0|502|213|679| |0.008|0.0|-1.0|502|236|679| |0.005|0.001|-1.0|502|246|679| |0.011|0.0|-1.0|502|527|155| |0.003|0.0|-1.0|502|159|277| |0.003|0.0|-1.0|502|159|277| |0.006|0.0|-1.0|502|527|155| |0.01|0.0|-1.0|502|43|277| |0.005|0.001|-1.0|502|159|277| |0.003|0.0|-1.0|502|76|277| |0.009|0.001|-1.0|502|527|155| |0.003|0.0|-1.0|502|182|277| |0.009|0.001|-1.0|502|231|679| |0.001|6.072|-1.0|502|2130|679| |0.004|0.0|-1.0|502|326|277| |0.004|0.0|-1.0|502|267|272| |0.005|0.001|-1.0|502|260|277| |0.001|11.049|-1.0|502|4338|679| |0.001|10.246|-1.0|502|2377|277| |0.001|8.06|-1.0|502|2213|679| |0.001|8.636|-1.0|502|24132|679| |0.001|10.642|-1.0|502|3055|679| |0.005|0.001|-1.0|502|200|679| |0.003|0.0|-1.0|502|211|679| |0.003|0.0|-1.0|502|102|277| |0.006|0.001|-1.0|502|272|679| |0.001|0.0|-1.0|502|234|277| |0.003|0.0|-1.0|502|527|155| |0.001|3.427|-1.0|502|2186|679| |0.001|8.819|-1.0|502|2036|679| |0.002|10.29|-1.0|502|2186|679|

This seems to align with documented behavior of ALBs according to AWS.

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target

The load balancer receives a request and forwards it to the target. The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer. Make sure that the duration of the keep-alive timeout is greater than the idle timeout value.

Check the values for the request_processing_time, target_processing_time and response_processing_time fields.

In this access log entry, the request_processing_time is 0.001, the target_processing_time is 4.205, and the response_processing_time is -1.

I'm also curious if you're run into any issues with the static timeout values in uWebSockets . Are there any prior reports of running into this 502 issue on existing deployments because of this?

Given the scenario, there are a few things I'd like to follow up on:

FYI, there's an existing issue on hyper-express if any conversation needs to be moved back there.

tayler-king commented 1 year ago

Are you able to reproduce this with just a uWS setup, without the hyper-express wrapper?

uNetworkingAB commented 1 year ago

Based on the following evidence, it seems that there's an issue with uSockets closing open connections with TCP RST/FIN before the AWS ALB timeout has elapsed.

uWS does not know about your proxy. It only looks at standard HTTP and shuts down any connection that lingers for more than 10 seconds without sending a new request. If a proxy expects to keep a connection open for long, it needs to send HEAD requests or similar, or just take the RST packet properly.

Are there any prior reports of running into this 502 issue on existing deployments because of this?

Yes I've head similar reports but it didn't go anywhere constructive. I don't see how this is an issue with uWS, since we follow the spec and protects for long lingering connection hogging. If we don't have a timeout, we will have connection memory leaks.

uNetworkingAB commented 1 year ago

Did you see this one https://github.com/uNetworking/uWebSockets/releases/tag/v20.44.0

Question: Have you ever got a request to work with this load balancer at all? Or have you only seen these 502s?

jsonmorris commented 1 year ago

@tayler-king I can give that a shot with this lib, or do you want a reproducible build in uwebsockets directly?

@uNetworkingAB thanks for your insight! Of course there's need for a timeout - my ask is if you could revisit exposing that as a constructor parameter or similar. AWS is very clear that upstreams must not have a shorter timeout period than the idleTimeout of an ALB.

Have you ever got a request to work with this load balancer at all?

Yes, we are only seeing issues with 502s on ~0.05% of requests. At scale, that's a lot of requests.

If a proxy expects to keep a connection open for long, it needs to send HEAD requests or similar, or just take the RST packet properly.

Unfortunately this behavior is on AWS's side, and their ALB implementation is a black box.

Yes I've head similar reports but it didn't go anywhere constructive.

Any other ideas for how I can help debug this and give you more fruitful feedback?

Some more information on our infrastructure setup.

zaksmok commented 1 year ago

Is it possible that you receive some crazy requests like ./favicon.ico which causes errors with servers? Maybe AWS sends some health checks that are causing server hangs and AWS returns 502s.

I don't have any issues with uWebsocket.js, using on many productions, one platform handles 600 requests per second and we have 0 failures.

jsonmorris commented 1 year ago

The 502s are pretty evenly distributed across different request paths, including some that don't exist (like favicon.ico and other crawled endpoints) and others, like health checks, that do.

which causes errors with servers ... that are causing server hangs

Any advice on how I could measure server hangs? I suspect that's in usock or uws but not uws.js.

zaksmok commented 1 year ago

try to add catchall to your code app.get('/*) etc to see if it helps.

uNetworkingAB commented 1 year ago

In general, what was added in v20.44 (see release post) should help debug this, as you would get other errors other than just 502. For instance, malformed requests or unhandled requests would see other status numbers so you can differ.

Right now, uWS.js does not have the latest from uWS, so those errors are not used. Maybe I should push a new release and you can try that one, then we will at least get clear errors for some failures

uNetworkingAB commented 1 year ago

20.44 supports these errors:

HTTP_ERROR_505_HTTP_VERSION_NOT_SUPPORTED = 1,
HTTP_ERROR_431_REQUEST_HEADER_FIELDS_TOO_LARGE = 2,
HTTP_ERROR_400_BAD_REQUEST = 3

So instead of just dropping the connection with a RST (which would be seen as 502), those errors would be returned.

jsonmorris commented 1 year ago

Maybe I should push a new release and you can try that one, then we will at least get clear errors for some failures

@uNetworkingAB that would be fantastic!

jsonmorris commented 1 year ago

@zaksmok thanks for the suggestion. That did help us determine that some of the requests that were 502'ing were because there were no routes for some requests paths. However, after adding that handler, there are still some requests throwing 502, so hoping to dig into this a bit more.

zaksmok commented 1 year ago

I would recommend you adding some global try catch for whole code + implement onAborted if you use async calls inside uWebsocketJs.

I also suggest to use something like Sentry, you can easily integrate their nice global catch helper :)

jsonmorris commented 1 year ago

if you use async calls inside uWebsocketJs.

We do use async handlers. Can you speak more to this point? Is this know to be risky, or otherwise require special behavior in an abort event handler?

uNetworkingAB commented 1 year ago

That did help us determine that some of the requests that were 502'ing were because there were no routes for some requests paths. However, after adding that handler, there are still some requests throwing 502, so hoping to dig into this a bit more.

Ah. So you would have preferred a 404 rather than just a RST. 404 should also be added to the list of errors we can return in case there are no handlers.

The rest of your 502s are probably things like invalid header names, too long requests and so on - everything that is malformed or wrong in any way, or unhandled, simply gets a RST in the current uWS.js.

What percentage of 502s do you have remaining now? Must be super tiny?

uNetworkingAB commented 1 year ago

Also, instead of app.get("/*"); use app.any

uNetworkingAB commented 1 year ago

Async responses are not risky. But you must end them, or close them, eventually. They do not time out if you don't act on them.

jsonmorris commented 1 year ago

404 should also be added to the list of errors we can return in case there are no handlers.

That would have saved me some time for sure. :) I think a lot of people would appreciate this as an option out of the box, or just the default. But it's also not that difficult to add.

The rest of your 502s are probably things like invalid header names, too long requests and so on - everything that is malformed or wrong in any way, or unhandled, simply gets a RST in the current uWS.js.

Having logs for all of these unknown, non-handler scenarios (invalid header names, too long requests) would go a long way towards helping us debug where there might be a problem. As it stands today, the request is dropped from the loop and there's no way for us to tell why.

Another thought here is to return HTTP 500s for processing errors happening at the application layer, instead of dropping back to a protocol level error - which as you rightly pointed out earlier are not handled gracefully by load balancers.

What percentage of 502s do you have remaining now? Must be super tiny?

Still a non-zero amount, early indications are about half of the requests previously throwing 502 are now correctly throwing a 404.

uNetworkingAB commented 1 year ago

Alright, so what is the practical problem now? Other than having some 502s made into 505s, 431s and 400s for clarity. What is the actual problem now?

uNetworkingAB commented 1 year ago

Anyways, the new error codes is budilgin on CI now, will release when finished then you will get much more detailed failures.

jsonmorris commented 1 year ago

@uNetworkingAB Super appreciate your timely responses here! The practical problem is that our setup with uws, which otherwise works exceptionally for the majority of traffic, drops customer requests with no ability for us to root cause why. Is it an application level error with headers length? Did the requests time out?

Short of capturing all tcp traffic to investigate later, there's not much available today in uws.js to help us debug. Even a configurable stack track would be helpful; adding individual HTTP level error codes is understandably a lot of work!

tayler-king commented 1 year ago

Short of capturing all tcp traffic to investigate later, there's not much available today in uws.js to help us debug. Even a configurable stack track would be helpful; adding individual HTTP level error codes is a lot of work!

Run caddy or another reverse proxy in front of your uWS instance and log failed requests? Should help narrow it down.

uNetworkingAB commented 1 year ago

https://github.com/uNetworking/uWebSockets.js/releases/tag/v20.31.0

e3dio commented 1 year ago

You are missing the case of hitting uWS idletimeout at same time proxy makes request, or a request takes 10+ seconds with no other requests

if the application closes the TCP connection to the load balancer ungracefully the load balancer might send a request to the application before it receives the packet indicating that the connection is closed. If this is the case, then the load balancer sends an HTTP 502 Bad Gateway error to the client.

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout

2 options to fix: increase uWS idletimeout to 60 seconds, or decrease AWS load balancer timeout to 10 seconds. The link above shows how to decrease AWS load balancer timeout, this should fix remaining 502 error

Another thought here is to return HTTP 500s for processing errors happening at the application layer

@jsonmorris Your app needs to correctly catch and handle all possible errors, otherwise node.js will crash and AWS will give 502 error. I use a top level error handling middleware to catch any uncaught error and return the correct error code and response. This means you need to know how to handle async promise and function errors correctly in JavaScript to not let any crash your process. You are using "hyper express" I can't speak to that don't know how it works, it might have some error handling built in, using plain uWS.js is very easy tho no need for extra layers

jsonmorris commented 1 year ago

@e3dio thanks for your response.

2 options to fix: increase uWS idletimeout to 60 seconds, or decrease AWS load balancer timeout to 10 seconds. The link above shows how to decrease AWS load balancer timeout, this should fix remaining 502 error

The idleTimeout in uws is not configurable, so option #1 isn't actually an option for people that don't want to fork and maintain their own copy of the build infrastructure for node shared libs.

Which leaves us with:

option #2, decrease AWS load balancer timeout to 10 seconds

This actually led to an increase in 502s and 504s, probably due to real application timeout behaviors, so the is untenable for us. Ultimately I don't think this is a correct path forward. We would like to make idleTimeouts longer for the ALB -> Upstream connection to encourage connection re-use, not shorter.

Your app needs to correctly catch and handle all possible errors

We are using error handling in the application layer. hyper-express itself has global error handlers and we have those set up correctly.

The problem here is with error parsing in the uws http context itself, which makes its own decision to close open TCP connections with RSTs. This does not, as you suspect, happen at the HTTP layer. It is "uncatchable" behavior from the point of view of node. Until recently (see the above discussion), there was no insight to this and no HTTP level errors thrown for this behavior.

uNetworkingAB commented 1 year ago

Did the recent version help in reducing 502s? What remains now?

jsonmorris commented 1 year ago

@uNetworkingAB doing some investigations today - will follow up here. Thanks again for all of your help!

e3dio commented 1 year ago

We would like to make idleTimeouts longer for the ALB -> Upstream connection to encourage connection re-use, not shorter

This is a case where uWS http timeout could use a custom value, 10 seconds assumes many direct client connections where lingering connections is a waste of resources, for a direct connection to load balancer might as well add an app option httpTimeout so it can be increased

uNetworkingAB commented 1 year ago

Yeah we need 2 configs: how long a connection will wait for another request, and also a config for how long a client can stall uploads. Currently I think both are 10 seconds.

Ideally 10s should be configurable and the upload config should be "minimal throughput in bytes per second" rather than a fix timeout.

jsonmorris commented 1 year ago

@uNetworkingAB good point, what are the chances we could get HTTP_IDLE_TIMEOUT_S modified to be an application option on the default HttpContext rather than a hardcoded parameter?

I'm still seeing some 502s today, so it seems whatever is causing connection closes is likely not one of the errors cases for which you've already added HTTP codes.

uNetworkingAB commented 1 year ago

You can begin by cloning the repo recursively then changing HTTP_IDLE_TIMEOUT_S to 60 and then just hit make like we do in the github actions. If that solves the problem, I can work on making it configurable. But first we need to know if this fixes the problem,

uNetworkingAB commented 1 year ago

Did this solve the problem?

rolljee commented 10 months ago

Hello, I am currently experiencing the same issue on AWS as well. We have fork and modified the repository to increase HTTP_IDLE_TIMEOUT & HTTP_TIMEOUT to set them both to 60. We are still seing 502.

AWS ALB is configured to have a 50s IDLE_TIMEOUT (but we tested 60 as well)

@jsonmorris did you manage to get this working by any chance ?

Also, we did a bit of digging and we found out the version 20.14.0 it work properly properly and we start seing this 502 after upgrading to 20.15.0.

Since 20.15.0 update uWs from 20.25.0 to 2.30.0, there is some implementation between theses version where the ALB does not comply. But I don't manage to pin point the exact version, I don't know how to build binaries for each individual version and test them inside our application

jsonmorris commented 9 months ago

Hi @rolljee. I haven't looked into this in a few months - we gave up and ate the performance cost of moving back to regular express. But, I'm very interested in what's going on here so +1 to any further engagement by @uNetworkingAB authors.

20.14.0 it work[s] properly ... [but we see] 502 after upgrading to 20.15.0.

Are you back to using 20.14.0 for the time being?

I'm sure a repro of this would be very useful for the authors if you could post a simplified example.

The source commit / parent for uws (commit diff, version diff) for the v20.15.0 release of uws.js do show some changes to http parsing and socket close behavior but far be it for me to understand what the implications are.