tmds / Tmds.LinuxAsync

MIT License
25 stars 3 forks source link

ASP.NET Schedulers #84

Open tmds opened 4 years ago

tmds commented 4 years ago

@davidfowl @halter73 we've added control over the schedulers that get used.

We can control these:

The bold values are the default.

When we inline all these scheduler and use ProcessorCount threads, we measured 1m rps for middleware JSON on Citrine.

I wonder which of these would be safe to inline in a production app, where there may be potentially blocking code from the user?

I guess this would be c (because the application code is decoupled to ThreadPool later) and also o (because the application code runs on the output pipe writer scheduler). Is that correct? What about i?

cc @antonfirsov @adamsitnik

halter73 commented 4 years ago

For c, we've found inline scheduling is best on linux with Kestrel's Socket transport, so that's already the default there. See https://github.com/aspnet/KestrelHttpServer/issues/2573.

For i (and the output pipe write scheduler) we've always used the threadpool as a safe default in case the app developer blocks. This used to be configurable via KestrelServerOptions.ApplicationSchedulingMode, but that option got removed when pipe creation was made a transport responsibility. Generally this did result in faster benchmark runs, but we have not added similar transport options yet.

For o, it would certainly be safe to be inline (we basically do this on a per-HTTP/2-stream in Http2OutputProducer in Kestrel), but the benchmarking I've done before with Kestrel's Socket transport has shown the ioqueue to be slightly faster.

tmds commented 4 years ago

For c, we've found inline scheduling is best on linux with Kestrel's Socket transport, so that's already the default there

This c is a different one. It controls whether the Socket continuations are run on threadpool or stay on epoll thread. I assume it is safe to inline, because i defers to ThreadPool later.

For i (and the output pipe write scheduler) we've always used the threadpool as a safe default in case the app developer blocks.

ok, I will consider it unsafe to use inline here.

There is another use of ThreadPool not mentioned yet. The Handler gets put on ThreadPool by KestrelConnection class (see stacktrace below of JSON middleware). Though I assume that doesn't make it safe to set i=inline? Do you recall the rationale for running the handler on ThreadPool? Is this also happening for JSON platform benchmark?

For o, it would certainly be safe to be inline (we basically do this on a per-HTTP/2-stream in Http2OutputProducer in Kestrel), but the benchmarking I've done before with Kestrel's Socket transport has shown the ioqueue to be slightly faster.

Yes, benchmarks we ran showed the same thing, though it's hard to understand why.

   at web.JsonMiddleware.Invoke(HttpContext httpContext) in /home/tmds/repos/Tmds.LinuxAsync/test/web/JsonMiddleware.cs:line 20
   at web.PlaintextMiddleware.Invoke(HttpContext httpContext) in /home/tmds/repos/Tmds.LinuxAsync/test/web/PlaintextMiddleware.cs:line 25
   at Microsoft.AspNetCore.HostFiltering.HostFilteringMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Hosting.HostingApplication.ProcessRequestAsync(Context context)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequestsAsync[TContext](IHttpApplication`1 application)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequestsAsync[TContext](IHttpApplication`1 application)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.HttpConnection.ProcessRequestsAsync[TContext](IHttpApplication`1 httpApplication)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.HttpConnection.ProcessRequestsAsync[TContext](IHttpApplication`1 httpApplication)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.HttpConnectionMiddleware`1.OnConnectionAsync(ConnectionContext connectionContext)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.ConnectionDispatcher.<>c__DisplayClass8_0.<StartAcceptingConnectionsCore>b__1(ConnectionContext c)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.KestrelConnection`1.ExecuteAsync()
tmds commented 4 years ago

Though I assume that doesn't make it safe to set i=inline?

I figured out this isn't safe. The stacktrace changes for successive requests on the same connection. I still have these questions:

Do you recall the rationale for running the handler on ThreadPool? Is this also happening for JSON platform benchmark?

tmds commented 4 years ago

Do you recall the rationale for running the handler on ThreadPool?

I think I got this too. If there is already a request in the input pipe the request handling would start, and stops Kestrel from accepting more connections until the handling goes async. It reduces kestrels accept connections per second performance.

Is this also happening for JSON platform benchmark?

I guess so. It doesn't matter much for the TE benchmarks, it only affects the first request from a connection.

tmds commented 4 years ago

Adding some graphs from benchmarks.

Using threadpool

i=threadpool
c=threadpool

tp2

(x-axis is nr of epoll threads from 0 to ProcessorCount, I cut off the values because excel webview renders them wrong in chrome)

Since epoll threads don't do much work, we don't need many of them.

Here IoQueue has better performance than Inline. This raises the question: could ThreadPool improve so Inline performs as good as IoQueue?

Using epoll threads

i=inline
c=inline

in2

Since epoll threads do much work, we need a lot of them.

Here IoQueue reduces worse than Inline.

Other

@antonfirsov and I discussed running benchmarks also for this combination:

c=inline
i=threadpool

It would be meaningful also in light of previous discussion because this could be used safely by Kestrel even when user may have blocking handlers.

davidfowl commented 4 years ago

That would be an ideal option IMO for kestrel and it should be opt-in at the socket level

halter73 commented 4 years ago

It looks like you answered all of your own questions @tmds 😄.

I agree an option to enable the inline scheduling of app code (i.e. ApplicationSchedulingMode) is something we should add back as a SocketTransportOption. I'll look into this.

halter73 commented 4 years ago

This c is a different one. It controls whether the Socket continuations are run on threadpool or stay on epoll thread.

That makes sense. IIRC, the reason we run Socket.ReceiveAsync continuations inline on linux, but not Windows, is because on linux the continuation is already dispatched to a threadpool thread, whereas on Windows we're still on an I/O thread.

I assume it is safe to inline, because i defers to ThreadPool later.

Agreed. As long as we're not calling into non-framework code it should be safe to inline.

tmds commented 4 years ago

As long as we're not calling into non-framework code it should be safe to inline.

@halter73 Is combination of c=inline, i=threadpool safe to use with potentially blocking application code? I think so, since i=threadpool will put that code on the threadpool. Is that correct?

halter73 commented 4 years ago

Yes. Running continuations inline is safe if all they do is write to a PipeWriter and the reader scheduler is set to the threadpool.

antonfirsov commented 4 years ago

@tmds @halter73: I did a comparison between the configurations discussed here.

image

Explanation of labels:

Optimal c=inline c=threadpool
c=inline i=inline a=true o=iothread c=inline i=threadpool a=true o=iothread c=threadpool i=threadpool a=true o=iothread

Note: I did run this on asp-citrine-lin, while last weekend's benchmarks were run on asp-citrine-db which has 5.5 kernel.

tmds commented 4 years ago

Combining with results from https://github.com/tmds/Tmds.LinuxAsync/issues/78#issuecomment-610658950, and taking best results for (c,i) combinations:

  Config 1 Config 2 Config 3 Config 4
c threadpool inline inline threadpool
i threadpool inline threadpool inline
t 1 ProcessorCount ProcessorCount/2 1
o ioqueue iothread/inline ioqueue/inline ioqueue/inline
rps, AIO 900k 1026k/1000k 900k/860k 870k/840k
rps, no AIO 887k 978k/861k
app safe yes no yes yes

Config 1 is most similar to the current behavior. To do better (900k++), we need to move to config 2, which is not suitable for running user-code that may block the epoll threads.

Some thoughts about o:

edit: added rps, no AIO row

tmds commented 4 years ago

To do better (900k++), we need to move to config 2

The gain we get in config 2 is thanks to using AIO. With o=inline and AIO disabled, we get 860k.

adamsitnik commented 4 years ago

I wish we could get rid of ioqueue by making ThreadPool be smarter about this workload, but I don't know if this is feasible.

@kouvel might be able to answer this question

kouvel commented 4 years ago

I wish we could get rid of ioqueue by making ThreadPool be smarter about this workload, but I don't know if this is feasible.

Yes I think that would be feasible and would be a better option than having another pool of threads, but it may have some tradeoffs, would have to try it out.

kouvel commented 4 years ago

I misunderstood and thought there was another pool of threads. I did some experiments relevant to using the thread pool directly instead of the IOQueue, so far I have not seen the thread pool performing better. It seems there is benefit from batching and maybe other things. I have posted some more info in our Teams channel. Based on the info I have so far it does not appear to be feasible to remove the IOQueue by using the thread pool instead.