microsoft / reverse-proxy

A toolkit for developing high-performance HTTP reverse proxy applications.
https://microsoft.github.io/reverse-proxy
MIT License
8.53k stars 838 forks source link

504 timeouts when experiencing relatively high traffic #2591

Closed peaeater closed 1 week ago

peaeater commented 2 months ago

Describe the bug

We have slowly added sites to a YARP proxy on IIS for the past few weeks. The downstream sites are also running on IIS on the same machine. After adding a sixth site, YARP crapped the bed and returned constant 504 Gateway Timeout errors for all sites being proxied. The traffic it was handling was relatively intense for us (maybe up to a hundred thousand requests per hour?) but the reverse proxy site didn't appear to be sweating in terms of CPU or memory usage.

Many .NET Runtime exceptions are logged to the Windows Application event log in association with the bed-crapping. An example is below.

Does this mean the Windows machine is underpowered? Would setting connection timeouts help? Would setting minimum thread pool counts help? Just not sure where to start, and the YARP documentation doesn't appear to address anything like this.

Category: Yarp.ReverseProxy.Forwarder.HttpForwarder
EventId: 48
SpanId: 945791c6df1aaa48
TraceId: c6802e1420d99f5ea9c86fc1d16faf4f
ParentId: 0000000000000000
RequestId: 40006eb8-0003-6600-b63f-84710c7967bb
RequestPath: /apple-touch-icon.png

RequestTimedOut: The request timed out before receiving a response.

Exception: 
System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.TimeoutException: A connection could not be established within the configured ConnectTimeout.
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.CreateConnectTimeoutException(OperationCanceledException oce)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|285_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.SignalCompletion()
   at System.Net.Sockets.SocketAsyncEventArgs.<DnsConnectAsync>g__Core|112_0(MultiConnectSocketAsyncEventArgs internalArgs, Task`1 addressesTask, Int32 port, SocketType socketType, ProtocolType protocolType, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Net.Sockets.SocketAsyncEventArgs.<>c.<.cctor>b__173_0(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
   at System.Threading.ThreadPoolTypedWorkItemQueue`2.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Yarp.ReverseProxy.Forwarder.HttpForwarder.SendAsync(HttpContext context, String destinationPrefix, HttpMessageInvoker httpClient, ForwarderRequestConfig requestConfig, HttpTransformer transformer, CancellationToken cancellationToken)

Further technical details

YARP 2.2.0-preview.1.24266.1 Windows Server 2022 Standard 10.0.20348 Build 20348 24.0 GB RAM Intel Xeon Silver 4208 CPU @ 2.10 GHz, 2100 Mhz, 2 Cores, 2 Logical Processors x 4

MihaZupan commented 2 months ago

maybe up to a hundred thousand requests per hour?

That's ~30 requests/s, which isn't a lot for a proxy to handle. The specs you listed should be more than sufficient to handle it.

A connection could not be established within the configured ConnectTimeout.

This indicates that YARP wasn't able to establish new connections with your backend destinations. This may indicate a networking/connectivity issue between your servers, or your destination servers not accepting connections for some reason. It's hard to provide more info from YARP as all we see is that connections aren't going through.

Does the issue only appear at high load? Can you establish new connections to backend servers via other means while YARP can't?

peaeater commented 2 months ago

The issue only appears at high load, and has happened twice at about the same threshold (where "threshold" is 6 sites being proxied with ~30 requests/s).

The backend destinations are on the same server, also served through IIS, so there's no connectivity issue between servers here. As soon as the sites are un-proxied they respond normally, plus all of them act the same way when this problem occurs, so I don't suspect they are the problem. (All of them are asp.net 6 or 8 websites.)

It's also the case that when this problem occurs, and I do remove a site or sites from the proxy list, the problem doesn't immediately clear up, with the remaining sites responding normally through YARP. Instead, the same problem persists for quite some time. It's like there's some kind of resource exhaustion that takes a long time to clear up, even though YARP CPU and memory usage are relatively low. Restarting the YARP website, recycling its application pool, and/or killing its tasks in Task Manager doesn't help.

I should also mention that each of the proxied sites is being rate limited, with a non-partitioned fixed window rule of 75 requests per 10 seconds and a queue limit of 25. I can't imagine why that would matter, but who knows.

MihaZupan commented 1 month ago

I should also mention that each of the proxied sites is being rate limited, with a non-partitioned fixed window rule of 75 requests per 10 seconds and a queue limit of 25. I can't imagine why that would matter, but who knows.

By non-partitioned, you mean that the limit applies to all clients together, or is it based on e.g. IP? Does the issue persist if you remove the rate limiting?

A connection could not be established within the configured ConnectTimeout.

This is a pretty generic error indicating that we weren't able to establish new connections. Do you have any corresponding logs from the backend servers indicating why they're not accepting connections?

peaeater commented 1 month ago

By non-partitioned, you mean that the limit applies to all clients together, or is it based on e.g. IP?

Non-partitioned meaning the limit applies to all clients. YARP returns 429 status codes when a client has hit the rate limit.

Does the issue persist if you remove the rate limiting?

We didn't try that and aren't going to. Frankly, the primary purpose of the reverse proxy is to implement rate limiting.

Do you have any corresponding logs from the backend servers indicating why they're not accepting connections?

Well, no. That's the conundrum - the back end web applications ARE able to accept connections while YARP is logging 504 errors, if, for instance, one makes requests to them via a local hostname that bypasses the reverse proxy.

MihaZupan commented 1 month ago

Would you be able to capture a network trace while this is happening (Wireshark)?

MihaZupan commented 1 week ago

Closing this one as not actionable from our side at the moment. Please feel free to reopen if you're able to collect more info / create a minimal repro.