Armeria gRPC performance analysis

I noticed some reasonably large differences in performance between armeria-grpc and upstream grpc on the same business logic and did some profiling - want to capture them and maybe start an investigation. My initial suspicion is flushing on every write?

I've attached VisualVm profiles, and here's a screenshot of hot spots in the armeria version. I don't believe writevAddress, readAddress, invokeFlush0 are supposed to be so high here.

visualvm-profiles.tar.gz

perfasm profiles (presumably I need to understand its flags to get better information but for posterity)

armeria

....[Hottest Regions]...............................................................................
  2.75%    1.29%                                                  libc-2.24.so  malloc_consolidate (303 bytes) 
  2.26%    3.08%                                                  libc-2.24.so  _int_free (251 bytes) 
  1.81%    1.30%                                                     libjvm.so  PhaseIdealLoop::is_dominator (50 bytes) 
  1.60%    2.97%                                                  libc-2.24.so  malloc (110 bytes) 
  1.18%    1.75%                                                  libc-2.24.so  malloc (106 bytes) 
  1.15%    1.45%                                              libfcml.so.0.0.0  fcml_fn_dt_dts_tree_element_free (102 bytes) 
  1.09%    1.41%                                                  libc-2.24.so  _int_malloc (557 bytes) 
  0.99%    0.15%                                             [kernel.kallsyms]  entry_SYSCALL_64 (200 bytes) 
  0.82%    0.54%                                             [kernel.kallsyms]  _raw_spin_lock_irqsave (45 bytes) 
  0.79%    2.35%                                                   C2, level 4  java.util.regex.Pattern$Start::match, version 2533 (469 bytes) 
  0.71%    0.49%                                             [kernel.kallsyms]  copy_user_enhanced_fast_string (12 bytes) 
  0.67%    0.12%                                             [kernel.kallsyms]  mutex_unlock (24 bytes) 
  0.65%    0.27%                                             [kernel.kallsyms]  mutex_lock (48 bytes) 
  0.59%    0.40%                                             [kernel.kallsyms]  __fget (92 bytes) 
  0.56%    0.12%                                                  libc-2.24.so  __memset_avx2_erms (286 bytes) 
  0.54%    0.49%                                             [kernel.kallsyms]  native_write_msr (11 bytes) 
  0.52%    0.10%                                                   interpreter  method entry point (kind = zerolocals)  (14 bytes) 
  0.48%    0.43%                                                   C1, level 3  java.lang.Long::toHexString, version 1714 (63 bytes) 
  0.47%    1.89%                                                perf-14928.map  [unknown] (272 bytes) 
  0.45%    0.11%                                                  libc-2.24.so  free (82 bytes) 
 79.89%   79.27%  <...other 8288 warm regions...>
....................................................................................................
100.00%  100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
  2.78%    4.72%                                                  libc-2.24.so  malloc 
  2.75%    1.29%                                                  libc-2.24.so  malloc_consolidate 
  2.71%    5.63%                                                perf-14928.map  [unknown] 
  2.65%    3.53%                                                  libc-2.24.so  _int_free 
  2.60%    3.05%                                                  libc-2.24.so  _int_malloc 
  1.94%    1.45%                                                     libjvm.so  PhaseIdealLoop::is_dominator 
  1.56%    2.03%                                                   C1, level 3  java.lang.Long::toHexString, version 1714 
  1.15%    1.45%                                              libfcml.so.0.0.0  fcml_fn_dt_dts_tree_element_free 
  0.99%    0.15%                                             [kernel.kallsyms]  entry_SYSCALL_64 
  0.97%    0.30%                                                   interpreter  method entry point (kind = zerolocals)  
  0.96%    1.01%                                                   C1, level 1  com.linecorp.armeria.common.stream.DefaultStreamMessage$SubscriptionImpl::subscriber, version 1848 
  0.82%    0.54%                                             [kernel.kallsyms]  _raw_spin_lock_irqsave 
  0.79%    2.35%                                                   C2, level 4  java.util.regex.Pattern$Start::match, version 2533 
  0.71%    0.49%                                             [kernel.kallsyms]  copy_user_enhanced_fast_string 
  0.67%    0.12%                                             [kernel.kallsyms]  mutex_unlock 
  0.65%    0.27%                                             [kernel.kallsyms]  mutex_lock 
  0.60%    0.63%                                                        [vdso]  [unknown] 
  0.59%    0.40%                                             [kernel.kallsyms]  __fget 
  0.56%    0.12%                                                  libc-2.24.so  __memset_avx2_erms 
  0.54%    0.49%                                             [kernel.kallsyms]  native_write_msr 
 72.99%   69.09%  <...other 2652 warm methods...>
....................................................................................................

upstream

....[Hottest Regions]...............................................................................
  1.37%    1.10%   [kernel.kallsyms]  _raw_spin_lock_irqsave (45 bytes) 
  1.14%    0.29%   [kernel.kallsyms]  entry_SYSCALL_64 (200 bytes) 
  1.12%    0.61%        libc-2.24.so  malloc_consolidate (303 bytes) 
  1.11%    0.88%   [kernel.kallsyms]  native_write_msr (11 bytes) 
  1.07%    0.88%         C2, level 4  io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2676 (883 bytes) 
  0.96%    1.31%        libc-2.24.so  _int_free (249 bytes) 
  0.87%    0.68%   [kernel.kallsyms]  copy_user_enhanced_fast_string (12 bytes) 
  0.78%    0.43%   [kernel.kallsyms]  __fget (92 bytes) 
  0.68%    0.86%         C2, level 4  io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 1765 (706 bytes) 
  0.63%    0.55%   [kernel.kallsyms]  update_cfs_shares (238 bytes) 
  0.63%    1.28%        libc-2.24.so  malloc (108 bytes) 
  0.58%    0.55%         C1, level 3  java.lang.StringBuilder::append, version 40 (63 bytes) 
  0.56%    1.13%   [kernel.kallsyms]  __lock_text_start (27 bytes) 
  0.55%    0.05%           libjvm.so  SpinPause (9 bytes) 
  0.54%    0.79%        libc-2.24.so  malloc (106 bytes) 
  0.51%    0.21%   [kernel.kallsyms]  mutex_unlock (24 bytes) 
  0.46%    0.46%   [kernel.kallsyms]  mutex_lock (48 bytes) 
  0.46%    0.50%   [kernel.kallsyms]  _raw_spin_lock (48 bytes) 
  0.45%    0.79%              [vdso]  [unknown] (28 bytes) 
  0.45%    0.82%         C2, level 4  io.netty.channel.DefaultChannelPipeline$HeadContext::write, version 1964 (833 bytes) 
 85.08%   85.84%  <...other 5611 warm regions...>
....................................................................................................
100.00%  100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
  1.48%    1.35%         C1, level 3  java.lang.StringBuilder::append, version 40 
  1.37%    1.10%   [kernel.kallsyms]  _raw_spin_lock_irqsave 
  1.26%    0.88%         C2, level 4  io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState::writeAllocatedBytes, version 3077 
  1.21%    1.37%         C2, level 4  io.netty.channel.nio.NioEventLoop::select, version 1825 
  1.17%    2.07%        libc-2.24.so  malloc 
  1.14%    0.29%   [kernel.kallsyms]  entry_SYSCALL_64 
  1.14%    0.95%         C2, level 4  io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2676 
  1.13%    0.61%        libc-2.24.so  malloc_consolidate 
  1.11%    0.88%   [kernel.kallsyms]  native_write_msr 
  1.10%    1.49%        libc-2.24.so  _int_free 
  1.04%    1.35%        libc-2.24.so  _int_malloc 
  1.02%    1.51%         C2, level 4  io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener::onHeadersRead, version 2918 
  0.95%    1.11%         C2, level 4  io.grpc.stub.ClientCalls::blockingUnaryCall, version 3783 
  0.92%    1.71%         C2, level 4  io.netty.handler.codec.http2.HpackDecoder::decode, version 2345 
  0.91%    1.16%         C2, level 4  io.netty.buffer.AbstractReferenceCountedByteBuf::release, version 1480 
  0.87%    0.68%   [kernel.kallsyms]  copy_user_enhanced_fast_string 
  0.78%    0.43%   [kernel.kallsyms]  __fget 
  0.78%    1.01%              [vdso]  [unknown] 
  0.72%    0.98%         C2, level 4  io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 1765 
  0.70%    1.25%         C2, level 4  io.netty.handler.codec.http2.HpackEncoder::encodeHeader, version 1716 
 79.18%   77.05%  <...other 1692 warm methods...>
....................................................................................................

Armeria empty():

Result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.empty":
  13193.483 ±(99.9%) 239.014 ops/s [Average]
  (min, avg, max) = (10836.195, 13193.483, 14135.097), stdev = 704.737
  CI (99.9%): [12954.470, 13432.497] (assumes normal distribution)

Secondary result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.empty:·asm":
PrintAssembly processed: 1427570 total address lines.
Perf output processed (skipped 40.098 seconds):
 Column 1: cycles (258016 events)
 Column 2: instructions (263820 events)

WARNING: No hottest code region above the threshold (10.00%) for disassembly.
Use "hotThreshold" profiler option to lower the filter threshold.

....[Hottest Regions]...............................................................................
  1.96%    1.53%                                                       vmlinux  _raw_spin_lock_irqsave (235 bytes) 
  1.54%    3.91%                                                   C2, level 4  java.util.regex.Pattern$Start::match, version 2455 (475 bytes) 
  1.22%    4.04%                                                   C2, level 4  java.util.regex.Pattern$Slice::match, version 1956 (403 bytes) 
  1.01%    0.27%                                                   C2, level 4  com.linecorp.armeria.client.HttpClientDelegate::execute, version 6360 (9 bytes) 
  1.00%    2.30%                                                perf-25092.map  [unknown] (216 bytes) 
  0.83%    0.89%                                                   C2, level 4  io.netty.buffer.PoolArena::allocate, version 6297 (1214 bytes) 
  0.80%    0.18%                                                       vmlinux  entry_SYSCALL_64 (196 bytes) 
  0.69%    0.60%                                                   C1, level 1  com.linecorp.armeria.common.DefaultHttpData::offset, version 1975 (63 bytes) 
  0.68%    0.57%                                                   C2, level 4  java.lang.Class::getName, version 1214 (64 bytes) 
  0.67%    0.26%                                                       vmlinux  __fget (98 bytes) 
  0.64%    0.99%                                                       vmlinux  nf_iterate (146 bytes) 
  0.59%    1.28%                                                   C1, level 1  com.linecorp.armeria.common.DefaultHttpData::offset, version 1975 (14 bytes) 
  0.59%    1.58%                                                perf-25092.map  [unknown] (273 bytes) 
  0.56%    0.60%                                                   C2, level 4  com.linecorp.armeria.internal.ArmeriaHttpUtil::splitPathAndQuery, version 5240 (2765 bytes) 
  0.54%    0.36%                                                       vmlinux  native_write_msr (11 bytes) 
  0.53%    0.50%                                                   C1, level 1  com.linecorp.armeria.common.DefaultHttpData::offset, version 1975 (64 bytes) 
  0.52%    0.28%                                                       vmlinux  update_cfs_shares (237 bytes) 
  0.52%    0.66%                                               ip_tables.ko.gz  ipt_do_table (607 bytes) 
  0.49%    0.55%                                                   C2, level 4  io.netty.buffer.PooledByteBuf::deallocate, version 2613 (657 bytes) 
  0.48%    0.87%                                                   C2, level 4  io.netty.buffer.PooledByteBufAllocator::newDirectBuffer, version 3643 (752 bytes) 
 84.14%   77.77%  <...other 4756 warm regions...>
....................................................................................................
100.00%  100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
  5.28%    3.54%                                                   C2, level 4  com.linecorp.armeria.client.HttpClientDelegate::execute, version 6360 
  5.20%   10.13%                                                perf-25092.map  [unknown] 
  2.38%    3.22%                                                   C1, level 1  com.linecorp.armeria.common.DefaultHttpData::offset, version 1975 
  1.96%    1.53%                                                       vmlinux  _raw_spin_lock_irqsave 
  1.54%    3.91%                                                   C2, level 4  java.util.regex.Pattern$Start::match, version 2455 
  1.49%    1.47%                                                   C2, level 4  com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark::empty, version 5931 
  1.47%    1.39%                                                   C2, level 4  java.lang.Class::getName, version 1214 
  1.22%    4.04%                                                   C2, level 4  java.util.regex.Pattern$Slice::match, version 1956 
  0.87%    0.95%                                                        [vdso]  [unknown] 
  0.84%    0.90%                                                   C2, level 4  io.netty.buffer.PoolArena::allocate, version 6297 
  0.81%    0.72%                                                   C2, level 4  io.netty.channel.AbstractChannel$AbstractUnsafe::flush0, version 2881 
  0.80%    0.18%                                                       vmlinux  entry_SYSCALL_64 
  0.75%    0.60%                                                   C2, level 4  io.netty.channel.DefaultChannelPromise::addListener, version 2472 
  0.70%    0.64%                                                   C2, level 4  io.netty.buffer.PooledByteBuf::deallocate, version 2613 
  0.67%    0.26%                                                       vmlinux  __fget 
  0.65%    1.08%                                                   C2, level 4  io.netty.handler.codec.http2.HpackEncoder::encodeHeader, version 2940 
  0.64%    0.68%                                                   C2, level 4  io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState::writeAllocatedBytes, version 4510 
  0.64%    0.99%                                                       vmlinux  nf_iterate 
  0.63%    0.93%                                               ip_tables.ko.gz  ipt_do_table 
  0.62%    0.71%                                                   C2, level 4  com.linecorp.armeria.client.HttpRequestSubscriber::write0, version 4511 
 70.84%   62.09%  <...other 1285 warm methods...>
....................................................................................................
100.00%   99.97%  <totals>

....[Distribution by Source]........................................................................
 48.99%   50.73%                                                   C2, level 4
 29.94%   24.27%                                                       vmlinux
  5.20%   10.13%                                                perf-25092.map
  3.44%    3.86%                                                   C1, level 1
  3.35%    2.69%                                                     libjvm.so
  2.04%    2.27%                                            nf_conntrack.ko.gz
  1.29%    0.65%                                                  libc-2.26.so
  1.03%    0.38%                                            libpthread-2.26.so
  0.87%    0.95%                                                        [vdso]
  0.70%    0.41%                                              Unknown, level 0
  0.63%    0.93%                                               ip_tables.ko.gz
  0.59%    0.63%                                                  runtime stub
  0.48%    0.47%                                                   C1, level 3
  0.34%    0.15%  libnetty_transport_native_epoll_x86_645657805159735482949.so
  0.31%    0.63%                                             nf_nat_ipv4.ko.gz
  0.26%    0.26%                                       nf_conntrack_ipv4.ko.gz
  0.13%    0.26%                                                  nf_nat.ko.gz
  0.12%    0.15%                                            xt_multiport.ko.gz
  0.08%    0.06%                                          nf_defrag_ipv4.ko.gz
  0.07%    0.04%                                          iptable_filter.ko.gz
  0.05%    0.09%                                             iptable_nat.ko.gz
  0.03%    0.01%                                                     nvidia.ko
  0.02%    0.00%                                            br_netfilter.ko.gz
  0.02%    0.00%                                                   interpreter
  0.01%    0.00%                                                   C1, level 2
  0.00%                                                         xhci-hcd.ko.gz
  0.00%                                                           e1000e.ko.gz
  0.00%    0.00%                                                 usbcore.ko.gz
  0.00%                                                     intel-cstate.ko.gz
  0.00%                                                           bridge.ko.gz
  0.00%                                                             ext4.ko.gz
  0.00%                                                       usb-common.ko.gz
  0.00%                                                            btusb.ko.gz
....................................................................................................
100.00%  100.00%  <totals>

Armeria simple():

Result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.simple":
  1136.192 ±(99.9%) 15.340 ops/s [Average]
  (min, avg, max) = (922.771, 1136.192, 1182.365), stdev = 45.231
  CI (99.9%): [1120.852, 1151.532] (assumes normal distribution)

Secondary result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.simple:·asm":
PrintAssembly processed: 1720336 total address lines.
Perf output processed (skipped 40.027 seconds):
 Column 1: cycles (268791 events)
 Column 2: instructions (265746 events)

WARNING: No hottest code region above the threshold (10.00%) for disassembly.
Use "hotThreshold" profiler option to lower the filter threshold.

....[Hottest Regions]...............................................................................
  9.81%   10.12%                                                  runtime stub  StubRoutines::jbyte_disjoint_arraycopy (205 bytes) 
  6.83%   16.91%                                                   C2, level 4  sun.nio.cs.UTF_8$Decoder::decode, version 669 (338 bytes) 
  5.75%    7.23%                                                   C2, level 4  com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder::readStringRequireUtf8, version 1923 (665 bytes) 
  5.51%   12.60%                                                   C2, level 4  com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1911 (390 bytes) 
  4.31%    4.01%                                                   C2, level 4  com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder::readStringRequireUtf8, version 1923 (415 bytes) 
  2.66%    2.98%                                                   C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::&lt;init&gt;, version 2120 (689 bytes) 
  2.58%    3.39%                                                   C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 3003 (3996 bytes) 
  2.09%    2.54%                                                   C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::&lt;init&gt;, version 2120 (2624 bytes) 
  2.04%    1.97%                                                  runtime stub  StubRoutines::jshort_disjoint_arraycopy (209 bytes) 
  1.85%    0.72%                                                  runtime stub  StubRoutines::jlong_disjoint_arraycopy (145 bytes) 
  1.64%    0.03%                                                       vmlinux  copy_user_enhanced_fast_string (16 bytes) 
  1.58%    1.28%                                                     libjvm.so  TypeArrayKlass::allocate_common (66 bytes) 
  1.18%    2.14%                                                  runtime stub  StubRoutines::unsafe_arraycopy (54 bytes) 
  1.04%    1.17%                                                   C2, level 4  com.google.protobuf.GeneratedMessageV3::writeString, version 1927 (192 bytes) 
  0.99%    0.28%                                                   C2, level 4  com.linecorp.armeria.client.grpc.ArmeriaClientCall::start, version 6439 (9 bytes) 
  0.98%    0.72%                                                   C2, level 4  com.google.protobuf.ByteString::isEmpty, version 1859 (57 bytes) 
  0.94%    1.64%                                                   C2, level 4  com.google.protobuf.GeneratedMessageV3::writeString, version 1927 (92 bytes) 
  0.65%    0.00%                                                   C2, level 4  com.linecorp.armeria.internal.grpc.ArmeriaMessageFramer::writePayload, version 6159 (111 bytes) 
  0.64%    0.65%                                                   C2, level 4  com.google.protobuf.GeneratedMessageV3::writeString, version 1927 (89 bytes) 
  0.61%    0.37%                                                  runtime stub  StubRoutines::jint_disjoint_arraycopy (130 bytes) 
 46.32%   29.25%  <...other 8152 warm regions...>
....................................................................................................
100.00%  100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
 10.07%   11.24%                                                   C2, level 4  com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder::readStringRequireUtf8, version 1923 
  9.81%   10.12%                                                  runtime stub  StubRoutines::jbyte_disjoint_arraycopy 
  6.83%   16.91%                                                   C2, level 4  sun.nio.cs.UTF_8$Decoder::decode, version 669 
  5.59%    6.62%                                                   C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::&lt;init&gt;, version 2120 
  5.51%   12.60%                                                   C2, level 4  com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1911 
  4.35%    2.81%                                                   C2, level 4  com.linecorp.armeria.client.grpc.ArmeriaClientCall::start, version 6439 
  3.73%    3.03%                                                perf-27419.map  [unknown] 
  3.56%    4.25%                                                   C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 3003 
  2.61%    3.45%                                                   C2, level 4  com.google.protobuf.GeneratedMessageV3::writeString, version 1927 
  2.04%    1.97%                                                  runtime stub  StubRoutines::jshort_disjoint_arraycopy 
  1.85%    0.72%                                                  runtime stub  StubRoutines::jlong_disjoint_arraycopy 
  1.64%    1.29%                                                   C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$User::writeTo, version 3004 
  1.64%    0.03%                                                       vmlinux  copy_user_enhanced_fast_string 
  1.61%    1.29%                                                     libjvm.so  TypeArrayKlass::allocate_common 
  1.18%    2.14%                                                  runtime stub  StubRoutines::unsafe_arraycopy 
  1.01%    0.94%                                                   C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$User::&lt;init&gt;, version 2143 
  0.98%    0.72%                                                   C2, level 4  com.google.protobuf.ByteString::isEmpty, version 1859 
  0.83%    0.52%                                                   C1, level 3  io.netty.util.ResourceLeakDetector::track0, version 2166 
  0.70%    0.01%                                                   C2, level 4  com.linecorp.armeria.internal.grpc.ArmeriaMessageFramer::writePayload, version 6159 
  0.61%    0.37%                                                  runtime stub  StubRoutines::jint_disjoint_arraycopy 
 33.84%   18.93%  <...other 1681 warm methods...>
....................................................................................................
100.00%   99.96%  <totals>

....[Distribution by Source]........................................................................
 58.07%   70.31%                                                   C2, level 4
 15.58%   15.42%                                                  runtime stub
 12.43%    4.73%                                                       vmlinux
  4.68%    3.82%                                                     libjvm.so
  3.73%    3.03%                                                perf-27419.map
  1.79%    1.04%                                                   C1, level 3
  0.95%    0.56%                                                  libc-2.26.so
  0.54%    0.25%                                            nf_conntrack.ko.gz
  0.48%    0.15%                                                   C1, level 1
  0.45%    0.16%                                              Unknown, level 0
  0.35%    0.12%                                            libpthread-2.26.so
  0.24%    0.11%                                                        [vdso]
  0.16%    0.10%                                               ip_tables.ko.gz
  0.11%    0.01%                                                   interpreter
  0.11%    0.03%  libnetty_transport_native_epoll_x86_647019244700163940556.so
  0.09%    0.07%                                             nf_nat_ipv4.ko.gz
  0.08%    0.02%                                       nf_conntrack_ipv4.ko.gz
  0.03%    0.03%                                                  nf_nat.ko.gz
  0.03%    0.00%                                                     nvidia.ko
  0.03%    0.00%                                          nf_defrag_ipv4.ko.gz
  0.02%    0.02%                                            xt_multiport.ko.gz
  0.02%    0.00%                                          iptable_filter.ko.gz
  0.02%    0.01%                                             iptable_nat.ko.gz
  0.01%    0.00%                                            br_netfilter.ko.gz
  0.00%                                                         xhci-hcd.ko.gz
  0.00%                                                          usbcore.ko.gz
  0.00%                                                              kvm.ko.gz
  0.00%    0.00%                                                              
  0.00%                                                           e1000e.ko.gz
  0.00%                                                              libnet.so
  0.00%                                                               vmnet.ko
  0.00%                                                          libahci.ko.gz
  0.00%                                                           bridge.ko.gz
  0.00%                                                             ext4.ko.gz
  0.00%                                                        bluetooth.ko.gz
  0.00%    0.00%                                                   C1, level 2
  0.00%                                                          aacraid.ko.gz
  0.00%                                                       usb-common.ko.gz
....................................................................................................
100.00%  100.00%  <totals>

Benchmark                               Mode  Cnt      Score     Error  Units
DownstreamSimpleBenchmark.empty        thrpt  100  13193.483 ± 239.014  ops/s
DownstreamSimpleBenchmark.empty:·asm   thrpt             NaN              ---
DownstreamSimpleBenchmark.simple       thrpt  100   1136.192 ±  15.340  ops/s
DownstreamSimpleBenchmark.simple:·asm  thrpt             NaN              ---

I'm curious what would the numbers be if we use the same client for both upstream and downstream.

Upstream empty():

Result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.empty":
  22322.742 ±(99.9%) 256.653 ops/s [Average]
  (min, avg, max) = (19395.717, 22322.742, 23688.237), stdev = 756.747
  CI (99.9%): [22066.089, 22579.396] (assumes normal distribution)

Secondary result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.empty:·asm":
PrintAssembly processed: 1020749 total address lines.
Perf output processed (skipped 36.472 seconds):
 Column 1: cycles (302596 events)
 Column 2: instructions (312109 events)

WARNING: No hottest code region above the threshold (10.00%) for disassembly.
Use "hotThreshold" profiler option to lower the filter threshold.

....[Hottest Regions]...............................................................................
  2.24%    2.46%                  vmlinux  _raw_spin_lock_irqsave (235 bytes) 
  1.20%    0.48%                  vmlinux  entry_SYSCALL_64 (196 bytes) 
  1.02%    1.04%              C2, level 4  io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2422 (1083 bytes) 
  0.98%    0.66%                  vmlinux  __fget (98 bytes) 
  0.90%    0.86%                  vmlinux  update_cfs_shares (237 bytes) 
  0.84%    0.16%              C2, level 4  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject::await, version 2565 (451 bytes) 
  0.81%    0.60%              C1, level 3  java.util.concurrent.CountDownLatch$Sync::tryAcquireShared, version 4536 (63 bytes) 
  0.78%    0.99%              C2, level 4  io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 2993 (1030 bytes) 
  0.75%    0.20%              C1, level 3  java.io.ObjectStreamClass::lookup, version 4506 (9 bytes) 
  0.73%    0.38%                  vmlinux  native_write_msr (11 bytes) 
  0.73%    1.45%                  vmlinux  _raw_spin_trylock (47 bytes) 
  0.67%    1.31%              C2, level 4  io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 1780 (1800 bytes) 
  0.67%    0.28%              C2, level 4  io.netty.util.concurrent.SingleThreadEventExecutor::execute, version 2535 (416 bytes) 
  0.64%    1.42%                   [vdso]  [unknown] (29 bytes) 
  0.63%    0.75%              C2, level 4  io.netty.channel.nio.NioEventLoop::run, version 4399 (643 bytes) 
  0.59%    0.86%              C2, level 4  io.grpc.stub.ClientCalls::blockingUnaryCall, version 3976 (4321 bytes) 
  0.57%    0.24%                  vmlinux  cpuacct_charge (109 bytes) 
  0.56%    0.40%                  vmlinux  mutex_unlock (29 bytes) 
  0.52%    0.64%              C2, level 4  io.grpc.internal.TransportFrameUtil::toHttp2Headers, version 3221 (1483 bytes) 
  0.49%    0.78%              C2, level 4  io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 1780 (1587 bytes) 
 83.69%   84.04%  <...other 3526 warm regions...>
....................................................................................................
100.00%  100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
  3.05%    2.42%              C2, level 4  sun.nio.ch.SelectorImpl::lockAndDoSelect, version 2160 
  2.24%    2.46%                  vmlinux  _raw_spin_lock_irqsave 
  2.20%    1.17%              C1, level 3  java.io.ObjectStreamClass::lookup, version 4506 
  1.84%    2.62%              C2, level 4  io.grpc.internal.ClientCallImpl::start, version 4127 
  1.70%    3.21%              C2, level 4  io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 1780 
  1.63%    1.44%              C1, level 3  java.util.concurrent.CountDownLatch$Sync::tryAcquireShared, version 4536 
  1.24%    1.51%              C2, level 4  io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener::onHeadersRead, version 3367 
  1.20%    1.84%              C2, level 4  io.grpc.stub.ClientCalls::blockingUnaryCall, version 3976 
  1.20%    0.48%                  vmlinux  entry_SYSCALL_64 
  1.10%    1.72%                   [vdso]  [unknown] 
  1.07%    1.11%              C2, level 4  io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2422 
  1.00%    0.38%              C2, level 4  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject::await, version 2565 
  0.98%    0.66%                  vmlinux  __fget 
  0.92%    0.76%              C2, level 4  io.grpc.stub.ClientCalls$ThreadlessExecutor::waitAndDrain, version 3963 
  0.91%    0.37%              C2, level 4  io.netty.util.concurrent.SingleThreadEventExecutor::execute, version 2535 
  0.91%    1.07%              C2, level 4  io.netty.buffer.AbstractReferenceCountedByteBuf::release, version 1631 
  0.90%    0.86%                  vmlinux  update_cfs_shares 
  0.90%    0.71%              C2, level 4  io.netty.util.concurrent.SingleThreadEventExecutor::runAllTasks, version 2478 
  0.89%    1.05%              C2, level 4  io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 2993 
  0.82%    1.06%              C2, level 4  io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe::flush0, version 2343 
 73.29%   73.08%  <...other 1076 warm methods...>
....................................................................................................
100.00%   99.97%  <totals>

....[Distribution by Source]........................................................................
 43.92%   49.12%              C2, level 4
 37.54%   35.51%                  vmlinux
  3.98%    2.75%              C1, level 3
  3.53%    3.42%                libjvm.so
  2.68%    1.83%       libpthread-2.26.so
  1.23%    0.62%         Unknown, level 0
  1.17%    0.91%       nf_conntrack.ko.gz
  1.10%    1.72%                   [vdso]
  1.01%    0.62%             libc-2.26.so
  0.75%    0.66%              C1, level 1
  0.74%    0.85%             runtime stub
  0.68%    0.72%           perf-31646.map
  0.64%    0.43%                         
  0.34%    0.30%          ip_tables.ko.gz
  0.25%    0.25%        nf_nat_ipv4.ko.gz
  0.15%    0.05%  nf_conntrack_ipv4.ko.gz
  0.07%    0.09%       xt_multiport.ko.gz
  0.05%    0.10%             nf_nat.ko.gz
  0.05%    0.01%     nf_defrag_ipv4.ko.gz
  0.03%    0.02%     iptable_filter.ko.gz
  0.03%    0.00%                nvidia.ko
  0.02%    0.03%        iptable_nat.ko.gz
  0.02%    0.00%              interpreter
  0.01%    0.00%       br_netfilter.ko.gz
  0.00%                    xhci-hcd.ko.gz
  0.00%                      e1000e.ko.gz
  0.00%                     libahci.ko.gz
  0.00%                     usbcore.ko.gz
  0.00%    0.00%              C1, level 2
  0.00%                        ld-2.26.so
  0.00%                         kvm.ko.gz
  0.00%                          vmnet.ko
  0.00%                      bridge.ko.gz
  0.00%                        ext4.ko.gz
  0.00%                       btusb.ko.gz
....................................................................................................
100.00%  100.00%  <totals>

Upstream simple():

Result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.simple":
  1741.894 ±(99.9%) 11.321 ops/s [Average]
  (min, avg, max) = (1606.254, 1741.894, 1788.062), stdev = 33.380
  CI (99.9%): [1730.573, 1753.215] (assumes normal distribution)

Secondary result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.simple:·asm":
PrintAssembly processed: 1328204 total address lines.
Perf output processed (skipped 36.927 seconds):
 Column 1: cycles (268779 events)
 Column 2: instructions (259372 events)

Hottest code regions (>10.00% "cycles" events):

....[Hottest Region 1]..............................................................................
C2, level 4, sun.nio.cs.UTF_8$Decoder::decode, version 671 (338 bytes) 
...
....................................................................................................
 11.43%   19.59%  <total for region 1>

....[Hottest Regions]...............................................................................
 11.43%   19.59%              C2, level 4  sun.nio.cs.UTF_8$Decoder::decode, version 671 (338 bytes) 
  8.37%   16.02%              C2, level 4  com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1070 (390 bytes) 
  5.71%    5.10%             runtime stub  StubRoutines::jbyte_disjoint_arraycopy (205 bytes) 
  5.48%    5.54%              C2, level 4  com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067 (379 bytes) 
  5.00%    0.21%                  vmlinux  copy_user_enhanced_fast_string (16 bytes) 
  4.49%    4.01%              C2, level 4  com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067 (407 bytes) 
  3.80%    3.45%              C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::&lt;init&gt;, version 1270 (2476 bytes) 
  3.30%    1.99%              C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::&lt;init&gt;, version 1270 (559 bytes) 
  1.90%    2.15%              C2, level 4  com.google.protobuf.GeneratedMessageV3::writeString, version 1085 (338 bytes) 
  1.28%    1.83%              C2, level 4  com.google.protobuf.GeneratedMessageV3::writeString, version 1085 (517 bytes) 
  1.15%    0.89%              C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 1758 (4788 bytes) 
  1.05%    0.28%                libjvm.so  SpinPause (9 bytes) 
  1.00%    0.53%              C2, level 4  com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067 (120 bytes) 
  0.72%    0.50%              C2, level 4  com.google.protobuf.ByteString::isEmpty, version 1039 (57 bytes) 
  0.66%    0.68%            perf-1033.map  [unknown] (773 bytes) 
  0.51%    0.40%              C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::&lt;init&gt;, version 1270 (466 bytes) 
  0.50%    0.43%                  vmlinux  _raw_spin_lock_irqsave (46 bytes) 
  0.46%    0.42%                  vmlinux  native_write_msr (11 bytes) 
  0.41%    0.32%                libjvm.so  ParallelTaskTerminator::offer_termination (111 bytes) 
  0.34%    0.13%                  vmlinux  __fget (98 bytes) 
 42.42%   35.52%  <...other 5264 warm regions...>
....................................................................................................
100.00%  100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
 11.43%   19.59%              C2, level 4  sun.nio.cs.UTF_8$Decoder::decode, version 671 
 10.98%   10.08%              C2, level 4  com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067 
  8.57%    6.79%              C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::&lt;init&gt;, version 1270 
  8.37%   16.02%              C2, level 4  com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1070 
  5.71%    5.10%             runtime stub  StubRoutines::jbyte_disjoint_arraycopy 
  5.00%    0.21%                  vmlinux  copy_user_enhanced_fast_string 
  3.92%    3.21%            perf-1033.map  [unknown] 
  3.18%    3.98%              C2, level 4  com.google.protobuf.GeneratedMessageV3::writeString, version 1085 
  1.51%    1.24%              C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 1758 
  1.05%    0.28%                libjvm.so  SpinPause 
  0.85%    0.61%              C2, level 4  com.linecorp.armeria.benchmarks.GithubApi$User::writeTo, version 1759 
  0.83%    0.57%              C2, level 4  sun.nio.ch.SelectorImpl::lockAndDoSelect, version 3061 
  0.72%    0.50%              C2, level 4  com.google.protobuf.ByteString::isEmpty, version 1039 
  0.54%    0.75%              C2, level 4  io.netty.channel.DefaultChannelPipeline$HeadContext::flush, version 3065 
  0.50%    0.43%                  vmlinux  _raw_spin_lock_irqsave 
  0.47%    0.44%              C1, level 3  java.io.UnixFileSystem::prefixLength, version 34 
  0.46%    0.31%              C2, level 4  io.netty.buffer.AbstractReferenceCountedByteBuf::release0, version 4297 
  0.46%    0.42%                  vmlinux  native_write_msr 
  0.41%    0.32%                libjvm.so  ParallelTaskTerminator::offer_termination 
  0.35%    0.48%              C2, level 4  io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 2728 
 34.66%   28.59%  <...other 1574 warm methods...>
....................................................................................................
100.00%   99.92%  <totals>

....[Distribution by Source]........................................................................
 61.44%   73.04%              C2, level 4
 18.64%   11.48%                  vmlinux
  6.25%    5.32%             runtime stub
  4.68%    3.12%                libjvm.so
  3.92%    3.21%            perf-1033.map
  1.04%    0.93%              C1, level 3
  0.78%    0.57%              C1, level 1
  0.59%    0.29%       libpthread-2.26.so
  0.59%    0.54%       nf_conntrack.ko.gz
  0.44%    0.21%             libc-2.26.so
  0.38%    0.13%         Unknown, level 0
  0.34%    0.40%                   [vdso]
  0.22%    0.11%                libnio.so
  0.18%    0.22%          ip_tables.ko.gz
  0.09%    0.14%        nf_nat_ipv4.ko.gz
  0.08%    0.05%  nf_conntrack_ipv4.ko.gz
  0.07%    0.04%              C1, level 2
  0.04%    0.08%             nf_nat.ko.gz
  0.04%    0.05%       xt_multiport.ko.gz
  0.04%    0.00%              interpreter
  0.03%    0.01%                nvidia.ko
  0.03%    0.03%        iptable_nat.ko.gz
  0.02%    0.01%     nf_defrag_ipv4.ko.gz
  0.01%    0.01%     iptable_filter.ko.gz
  0.01%    0.01%           hsdis-amd64.so
  0.01%    0.00%       br_netfilter.ko.gz
  0.01%                        libjava.so
  0.00%                      e1000e.ko.gz
  0.00%                         kvm.ko.gz
  0.00%                                  
  0.00%                          vmnet.ko
  0.00%                    xhci-hcd.ko.gz
  0.00%                  libmanagement.so
....................................................................................................
100.00%  100.00%  <totals>

Benchmark                             Mode  Cnt      Score     Error  Units
UpstreamSimpleBenchmark.empty        thrpt  100  22322.742 ± 256.653  ops/s
UpstreamSimpleBenchmark.empty:·asm   thrpt             NaN              ---
UpstreamSimpleBenchmark.simple       thrpt  100   1741.894 ±  11.321  ops/s
UpstreamSimpleBenchmark.simple:·asm  thrpt             NaN              ---

If this is accepted, performance of simple() will become much closer to upstream, but has no effect on empty() https://github.com/google/protobuf/pull/3824

Ran emptyNonBlocking on a Macbook Pro from #815

The results are complex. UpstreamSimpleBenchmark.emptyNonBlocking completes 0 requests during the benchmark run. I debugged the execution and noticed that the same event loop thread was used by the client and the server - this causes the continuous requests from the client to starve the server from being able to return anything until the requests stop coming at the end of the benchmark. As this is only possible in a benchmark where both client + server are in the same JVM, I'll try to see if I can work around this.

In the meantime, the okhttp benchmarks do not have this problem and are probably more useful. Both servers end up basically being DoS-ed by the client, but poor armeria-grpc seems to handle this situation much worse.

Benchmark                                                   (clientType)   Mode  Cnt       Score      Error  Units
DownstreamSimpleBenchmark.emptyNonBlocking                        NORMAL  thrpt   20   32838.177 ± 1469.516  ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:currentRequests        NORMAL  thrpt   20         ≈ 0             ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numFailures            NORMAL  thrpt   20         ≈ 0             ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numRequests            NORMAL  thrpt   20   11556.174 ± 1000.864  ops/s
DownstreamSimpleBenchmark.emptyNonBlocking                        OKHTTP  thrpt   20  114116.049 ± 6711.687  ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:currentRequests        OKHTTP  thrpt   20         ≈ 0             ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numFailures            OKHTTP  thrpt   20         ≈ 0             ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numRequests            OKHTTP  thrpt   20     910.325 ±  752.228  ops/s
UpstreamSimpleBenchmark.emptyNonBlocking                        NORMAL  thrpt   20  104237.020 ± 25718.438  ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:currentRequests        NORMAL  thrpt   20         ≈ 0              ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numFailures            NORMAL  thrpt   20         ≈ 0              ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numRequests            NORMAL  thrpt   20         ≈ 0              ops/s
UpstreamSimpleBenchmark.emptyNonBlocking                        OKHTTP  thrpt   20  111407.822 ±  6739.010  ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:currentRequests        OKHTTP  thrpt   20         ≈ 0              ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numFailures            OKHTTP  thrpt   20         ≈ 0              ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numRequests            OKHTTP  thrpt   20   11053.295 ±  1844.917  ops/s

I found in #816 that the number of active threads is completely different for upstream and downstream, so comparing results for NORMAL will be quite difficult until that is resolved.

JFR profiles of DownstreamSimpleBenchmark.empty / emptyNonBlocking. Tried taking them with jmh-profilers package, but

Possible hotspots (may be more, just a skim)

Timeout cancellations (PriorityQueue.remove) - with many concurrent requests, removing from a priority queue is very slow. Possibly better to just not remove and let the DONE check skip code when the callback is called, if there isn't a better solution.

https://github.com/line/armeria/blob/ac3d50bde10975ffc29278428360a9b9ab4a764d/core/src/main/java/com/linecorp/armeria/server/HttpResponseSubscriber.java#L289

Stream message processing (ConcurrentLinkedQueue) - possibly due to poor JIT or poor caching. As most messages will just have a few messages (headers, body, trailers, flush) I wonder if it's possible to inline the first few nodes of the queue into StreamMessage itself.
Media type parsing / matching (MediaType.parse) - as gRPC media types are well defined, I'll replace the parsing with string matches.
ArmeriaHttpUtil.splitPathAndQuery (Matcher.find) - PROHIBITED_PATH_PATTERN is pretty intense, wonder if there's a faster implementation than regex.
HttpHeaders.get (AsciiString.contentEquals) - Netty and armeria both intern headers, but they are different intern. Though netty may not be an explicit intern, but a result of hpack returning the same values for a stream. Would be nice if we could take advantage of this though.

downstream-empty-jfr.zip

When going through netty PRs, randomly found this one which could potentially help with gRPC streams by reducing flushing. Seems to be in limbo though.

https://github.com/netty/netty/pull/5716

@anuraaga We already have something similar at the beginning of the pipeline: com.linecorp.armeria.internal.FlushConsolidationHandler.

Still lack of time for tackling into this sadly..

Ah didn't know about that class - in that case my hypothesis about over-flushing is probably not correct :)

Lately I've been looking into this again, but haven't made much progress - there isn't anything obvious in profiling. I'm hoping in the end much of the difference is still some misconfiguration of the benchmark, but I have no leads. I'd really like to be able to say armeria gRPC performance is similar to upstream - @trustin would you happen to have some time to help look into the performance?

I recently got a report about gRPC performance from an internal customer and am waiting for detailed performance profile. That being said, I'd like to take a stab on this issue soon.

async-profiler flame graph for emptyNonBlocking() retrieved from a beefy machine: https://gist.github.com/trustin/52c5014a04331cf4c6a452e2cd6880b3

On the other hand, I ran grpc.upstream.UpstreamSimpleBenchmark.emptyNonBlocking() on the same machine and I got much less numbers - 110k (upstream) vs 160k (downstream). I gave -Xmx16384m -Xms16384m JVM option which may have made JVM happy. @anuraaga Can you reproduce the result?

Upstream seems to perform better at empty() though, it's 11k (upstrema) vs 8k (downstream).

I'll try to run it too - I don't know if I've tried since disabling request timeout but that was a huge problem in my first runs.

Also, nonBlocking is basically a death test so results varied quite a lot from run to run, it's why lately I've focused on blocking version with many threads, which gives more stable results.

By the way if it's not sensitive, what sort of report did you get from the internal user? I guess they may have migrated from upstream to armeria and saw a difference in CPU usage - if so would be cool hearing around how much difference there was in a real world migration.

They started with Armeria so there's no comparison with upstream unfortunately. They observed high CPU usage but it seemed like other third party library was also outstanding, so I'm not sure Armeria is really a bottleneck. I just want to make sure that Armeria performs close to the baseline performance.

cpu, alloc, lock, cache-misses profile for grpc.downstream.DownstreamSimpleBenchmark.empty() with 16 thread: https://gist.github.com/trustin/395659be9d0eb3d5265cbff50a1581c5

Oh, just found that we're in TechEmpower Benchmarks: https://www.techempower.com/benchmarks/#section=data-r18&hw=ph&test=plaintext Not the best numbers but I hope it will get much better once we start optimizing seriously.

Mentioned it in #2295, but one aspect of having competitive gRPC performance will be to optimize as much of the functionality we have that is not used by RPC frameworks, for example routing.

One idea I haven't explored much is if it's possible to have an especially optimized code path when we know the server does not support complex routing. For example, it should be possible for this object to have less memory footprint when not using all the different features

https://github.com/line/armeria/blob/master/core/src/main/java/com/linecorp/armeria/server/DefaultRoutingContext.java#L52

Just a brainstorm :) /cc @trustin @minwoox @ikhoon

That's a good idea. For example, we might be able to optimize a lot when there are only simple routes.

Oops, I clicked down(👎) by mistake. 😅

line / armeria

Armeria gRPC performance analysis #799