Performance Issue / Ktor & Netty

spolnik commented 5 years ago

Ktor Version

1.2.0

Ktor Engine Used(client or server and name)

Netty Server

JVM Version, Operating System and Relevant Context

11.0.2, Debian (docker)

4 Cores -Xmx2G (parallelism = 4)

Default settings

Feedback

For most of the cases performance is pretty good, we get < 50ms including db save (all done asynchronously). Unfortunately from time to time, we are getting requests which stuck for > 1s, in worst cases even > 20 seconds. Server got the request, but processing route wasn't invoked yet. In below there is part of NewRelic instrumentation stuck for such slow call, the problem is that Ktor is not yet well instrumented, and the only consecutive calls I see taking that time are below:

0 | 0.00% | Truncated: NettyUpstreamDispatcher |   | 0.000  s
-- | -- | -- | -- | --
0 | 0.00% | HttpServerExpectContinueHandler.channelRead() | Async | 0.000  s
0 | 0.00% | HttpServerExpectContinueHandler.channelRead() | Async | 0.000  s
0 | 0.00% | RequestBodyHandler.channelRead() | Async | 0.000  s
16.0 | 0.08% | NettyApplicationCallHandler.channelRead() | Async | 20.025  s
16.0 | 0.08% | NewRelicFeature.wrapIntoNewRelicTransaction() |   | 20.025  s
16.0 | 0.08% | com.revolut.eventstore.api.write.EventsControllerKt/saveEvent |   | 20.025  s
1.0 | 0.00% | Application code (in com.*.api.write.EventsControllerKt/saveEvent)

What I am trying to understand is - what could happen between:

0 | 0.00% | RequestBodyHandler.channelRead() | Async | 0.000  s
-- | -- | -- | -- | --
16.0 | 0.08% | NettyApplicationCallHandler.channelRead() | Async | 20.025  s

Why it took such long? It's pretty hard to understand where may be any blocking/under sourced part - so I would really appreciate help with it

spolnik commented 5 years ago

Seems the issue is mostly related to sizes of groups, do you have any recommendation for those 3 settings? Basically as I do a lot of asynchronous code, the recommended settings of:

embeddedServer(AnyEngine, configure = {
    connectionGroupSize = parallelism / 2 + 1
    workerGroupSize = parallelism / 2 + 1
    callGroupSize = parallelism 
})

Doesn't work, utilization of CPU is pretty low with it and I face above issue, only when I manually increase at least callGroupSize to 16 it starts to improve.

Summarising, now I have 4 Cores (parallelism), 2gb ram, using netty, and much more optimal results for

 connectionGroupSize = parallelism / 2 + 1 = 3 // default
 workerGroupSize = parallelism / 2 + 1 = 3  // default
 callGroupSize = 16  // manually set

I would need to analyse your code to understand what is best ratio between those 3 - again would repeat that your default settings seems not ideal at least for Netty

spolnik commented 5 years ago

Just another comment, would be useful if you guys could put the performance guideline as tests within e.g. https://www.techempower.com/benchmarks/#section=data-r17&hw=ph&test=fortune&l=xan3i3-1 showing how it may be optimised and how it compares with other frameworks - the given implementations seems not getting bet of ktor + it's quite out of date

spolnik commented 5 years ago

Additionally, I've spot only NioEventLoopGroup is used - is there any plan to support EpollEventLoopGroup ?

spolnik commented 5 years ago

after tweaks I still see few request are loosing time:

Java/io.ktor.server.netty.cio.RequestBodyHandler/channelRead 0.00 ms (timestamp)
// --- GAP --- (blocking/waiting/something?)
Java/io.ktor.server.netty.NettyApplicationCallHandler/channelRead 3.55 s (timetamp)

spolnik commented 5 years ago

Finally, after using shareWorkGroup = true I got

RequestBodyHandler.channelRead() | Async | 0.000  s (timesttamp)
ParametersCloseAsync context = nettyWorkerPool-3-16
// -- | -GAP- | -- 
NettyApplicationCallHandler.channelRead() | Async | 3.795  s (timestamp)
ParametersCloseAsync context = nettyWorkerPool-3-17

Which means it's somewhere in selection process between above two, looking for some ideas how to deal with it 🤔

Potential improvement may be switch select to EPoll or KQueue - but it has to be added to Ktor

Hc747 commented 5 years ago

Finally, after using shareWorkGroup = true I got
RequestBodyHandler.channelRead() | Async | 0.000  s (timesttamp)
ParametersCloseAsync context = nettyWorkerPool-3-16
// -- | -GAP- | -- 
NettyApplicationCallHandler.channelRead() | Async | 3.795  s (timestamp)
ParametersCloseAsync context = nettyWorkerPool-3-17
Which means it's somewhere in selection process between above two, looking for some ideas how to deal with it 🤔

Potential improvement may be switch select to EPoll or KQueue - but it has to be added to Ktor

See #1124 - should be implemented soon!

spolnik commented 5 years ago

@Hc747 - awesome - thanks for sharing :)

Hc747 commented 5 years ago

@spolnik Implemented https://github.com/ktorio/ktor/commit/4c2a48e55a25edb386113eebfb9347c279730ccb

beedub commented 4 years ago

@spolnik did you ever figure out what the perf issue was?

anderssv commented 4 years ago

Hey. We're seeing what I think is exactly the same behaviour as @spolnik that leads to some bad situations. Did you ever solve your problem, and how? :)

oleg-larshin commented 4 years ago

Please check the following ticket on YouTrack for follow-ups to this issue. GitHub issues will be closed in the coming weeks.

stevejuma commented 3 years ago

We have experienced the same performance issue with Netty as outlined above. Manifested in kubernetes with the service being restarted periodically every couple of days. It was difficult to try and reproduce the problem locally. Switching to using Tomcat instead of Netty mitigated the problems we were experiencing.

AsiaMa commented 3 years ago

Seems the issue is mostly related to sizes of groups, do you have any recommendation for those 3 settings? Basically as I do a lot of asynchronous code, the recommended settings of:
embeddedServer(AnyEngine, configure = {
    connectionGroupSize = parallelism / 2 + 1
    workerGroupSize = parallelism / 2 + 1
    callGroupSize = parallelism 
})
Doesn't work, utilization of CPU is pretty low with it and I face above issue, only when I manually increase at least callGroupSize to 16 it starts to improve.

Summarising, now I have 4 Cores (parallelism), 2gb ram, using netty, and much more optimal results for
 connectionGroupSize = parallelism / 2 + 1 = 3 // default
 workerGroupSize = parallelism / 2 + 1 = 3  // default
 callGroupSize = 16  // manually set
I would need to analyse your code to understand what is best ratio between those 3 - again would repeat that your default settings seems not ideal at least for Netty

Your code may have blocked code.

ktorio / ktor