Open anuraaga opened 7 years ago
perfasm profiles (presumably I need to understand its flags to get better information but for posterity)
armeria
....[Hottest Regions]...............................................................................
2.75% 1.29% libc-2.24.so malloc_consolidate (303 bytes)
2.26% 3.08% libc-2.24.so _int_free (251 bytes)
1.81% 1.30% libjvm.so PhaseIdealLoop::is_dominator (50 bytes)
1.60% 2.97% libc-2.24.so malloc (110 bytes)
1.18% 1.75% libc-2.24.so malloc (106 bytes)
1.15% 1.45% libfcml.so.0.0.0 fcml_fn_dt_dts_tree_element_free (102 bytes)
1.09% 1.41% libc-2.24.so _int_malloc (557 bytes)
0.99% 0.15% [kernel.kallsyms] entry_SYSCALL_64 (200 bytes)
0.82% 0.54% [kernel.kallsyms] _raw_spin_lock_irqsave (45 bytes)
0.79% 2.35% C2, level 4 java.util.regex.Pattern$Start::match, version 2533 (469 bytes)
0.71% 0.49% [kernel.kallsyms] copy_user_enhanced_fast_string (12 bytes)
0.67% 0.12% [kernel.kallsyms] mutex_unlock (24 bytes)
0.65% 0.27% [kernel.kallsyms] mutex_lock (48 bytes)
0.59% 0.40% [kernel.kallsyms] __fget (92 bytes)
0.56% 0.12% libc-2.24.so __memset_avx2_erms (286 bytes)
0.54% 0.49% [kernel.kallsyms] native_write_msr (11 bytes)
0.52% 0.10% interpreter method entry point (kind = zerolocals) (14 bytes)
0.48% 0.43% C1, level 3 java.lang.Long::toHexString, version 1714 (63 bytes)
0.47% 1.89% perf-14928.map [unknown] (272 bytes)
0.45% 0.11% libc-2.24.so free (82 bytes)
79.89% 79.27% <...other 8288 warm regions...>
....................................................................................................
100.00% 100.00% <totals>
....[Hottest Methods (after inlining)]..............................................................
2.78% 4.72% libc-2.24.so malloc
2.75% 1.29% libc-2.24.so malloc_consolidate
2.71% 5.63% perf-14928.map [unknown]
2.65% 3.53% libc-2.24.so _int_free
2.60% 3.05% libc-2.24.so _int_malloc
1.94% 1.45% libjvm.so PhaseIdealLoop::is_dominator
1.56% 2.03% C1, level 3 java.lang.Long::toHexString, version 1714
1.15% 1.45% libfcml.so.0.0.0 fcml_fn_dt_dts_tree_element_free
0.99% 0.15% [kernel.kallsyms] entry_SYSCALL_64
0.97% 0.30% interpreter method entry point (kind = zerolocals)
0.96% 1.01% C1, level 1 com.linecorp.armeria.common.stream.DefaultStreamMessage$SubscriptionImpl::subscriber, version 1848
0.82% 0.54% [kernel.kallsyms] _raw_spin_lock_irqsave
0.79% 2.35% C2, level 4 java.util.regex.Pattern$Start::match, version 2533
0.71% 0.49% [kernel.kallsyms] copy_user_enhanced_fast_string
0.67% 0.12% [kernel.kallsyms] mutex_unlock
0.65% 0.27% [kernel.kallsyms] mutex_lock
0.60% 0.63% [vdso] [unknown]
0.59% 0.40% [kernel.kallsyms] __fget
0.56% 0.12% libc-2.24.so __memset_avx2_erms
0.54% 0.49% [kernel.kallsyms] native_write_msr
72.99% 69.09% <...other 2652 warm methods...>
....................................................................................................
upstream
....[Hottest Regions]...............................................................................
1.37% 1.10% [kernel.kallsyms] _raw_spin_lock_irqsave (45 bytes)
1.14% 0.29% [kernel.kallsyms] entry_SYSCALL_64 (200 bytes)
1.12% 0.61% libc-2.24.so malloc_consolidate (303 bytes)
1.11% 0.88% [kernel.kallsyms] native_write_msr (11 bytes)
1.07% 0.88% C2, level 4 io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2676 (883 bytes)
0.96% 1.31% libc-2.24.so _int_free (249 bytes)
0.87% 0.68% [kernel.kallsyms] copy_user_enhanced_fast_string (12 bytes)
0.78% 0.43% [kernel.kallsyms] __fget (92 bytes)
0.68% 0.86% C2, level 4 io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 1765 (706 bytes)
0.63% 0.55% [kernel.kallsyms] update_cfs_shares (238 bytes)
0.63% 1.28% libc-2.24.so malloc (108 bytes)
0.58% 0.55% C1, level 3 java.lang.StringBuilder::append, version 40 (63 bytes)
0.56% 1.13% [kernel.kallsyms] __lock_text_start (27 bytes)
0.55% 0.05% libjvm.so SpinPause (9 bytes)
0.54% 0.79% libc-2.24.so malloc (106 bytes)
0.51% 0.21% [kernel.kallsyms] mutex_unlock (24 bytes)
0.46% 0.46% [kernel.kallsyms] mutex_lock (48 bytes)
0.46% 0.50% [kernel.kallsyms] _raw_spin_lock (48 bytes)
0.45% 0.79% [vdso] [unknown] (28 bytes)
0.45% 0.82% C2, level 4 io.netty.channel.DefaultChannelPipeline$HeadContext::write, version 1964 (833 bytes)
85.08% 85.84% <...other 5611 warm regions...>
....................................................................................................
100.00% 100.00% <totals>
....[Hottest Methods (after inlining)]..............................................................
1.48% 1.35% C1, level 3 java.lang.StringBuilder::append, version 40
1.37% 1.10% [kernel.kallsyms] _raw_spin_lock_irqsave
1.26% 0.88% C2, level 4 io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState::writeAllocatedBytes, version 3077
1.21% 1.37% C2, level 4 io.netty.channel.nio.NioEventLoop::select, version 1825
1.17% 2.07% libc-2.24.so malloc
1.14% 0.29% [kernel.kallsyms] entry_SYSCALL_64
1.14% 0.95% C2, level 4 io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2676
1.13% 0.61% libc-2.24.so malloc_consolidate
1.11% 0.88% [kernel.kallsyms] native_write_msr
1.10% 1.49% libc-2.24.so _int_free
1.04% 1.35% libc-2.24.so _int_malloc
1.02% 1.51% C2, level 4 io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener::onHeadersRead, version 2918
0.95% 1.11% C2, level 4 io.grpc.stub.ClientCalls::blockingUnaryCall, version 3783
0.92% 1.71% C2, level 4 io.netty.handler.codec.http2.HpackDecoder::decode, version 2345
0.91% 1.16% C2, level 4 io.netty.buffer.AbstractReferenceCountedByteBuf::release, version 1480
0.87% 0.68% [kernel.kallsyms] copy_user_enhanced_fast_string
0.78% 0.43% [kernel.kallsyms] __fget
0.78% 1.01% [vdso] [unknown]
0.72% 0.98% C2, level 4 io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 1765
0.70% 1.25% C2, level 4 io.netty.handler.codec.http2.HpackEncoder::encodeHeader, version 1716
79.18% 77.05% <...other 1692 warm methods...>
....................................................................................................
Armeria empty()
:
Result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.empty":
13193.483 ±(99.9%) 239.014 ops/s [Average]
(min, avg, max) = (10836.195, 13193.483, 14135.097), stdev = 704.737
CI (99.9%): [12954.470, 13432.497] (assumes normal distribution)
Secondary result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.empty:·asm":
PrintAssembly processed: 1427570 total address lines.
Perf output processed (skipped 40.098 seconds):
Column 1: cycles (258016 events)
Column 2: instructions (263820 events)
WARNING: No hottest code region above the threshold (10.00%) for disassembly.
Use "hotThreshold" profiler option to lower the filter threshold.
....[Hottest Regions]...............................................................................
1.96% 1.53% vmlinux _raw_spin_lock_irqsave (235 bytes)
1.54% 3.91% C2, level 4 java.util.regex.Pattern$Start::match, version 2455 (475 bytes)
1.22% 4.04% C2, level 4 java.util.regex.Pattern$Slice::match, version 1956 (403 bytes)
1.01% 0.27% C2, level 4 com.linecorp.armeria.client.HttpClientDelegate::execute, version 6360 (9 bytes)
1.00% 2.30% perf-25092.map [unknown] (216 bytes)
0.83% 0.89% C2, level 4 io.netty.buffer.PoolArena::allocate, version 6297 (1214 bytes)
0.80% 0.18% vmlinux entry_SYSCALL_64 (196 bytes)
0.69% 0.60% C1, level 1 com.linecorp.armeria.common.DefaultHttpData::offset, version 1975 (63 bytes)
0.68% 0.57% C2, level 4 java.lang.Class::getName, version 1214 (64 bytes)
0.67% 0.26% vmlinux __fget (98 bytes)
0.64% 0.99% vmlinux nf_iterate (146 bytes)
0.59% 1.28% C1, level 1 com.linecorp.armeria.common.DefaultHttpData::offset, version 1975 (14 bytes)
0.59% 1.58% perf-25092.map [unknown] (273 bytes)
0.56% 0.60% C2, level 4 com.linecorp.armeria.internal.ArmeriaHttpUtil::splitPathAndQuery, version 5240 (2765 bytes)
0.54% 0.36% vmlinux native_write_msr (11 bytes)
0.53% 0.50% C1, level 1 com.linecorp.armeria.common.DefaultHttpData::offset, version 1975 (64 bytes)
0.52% 0.28% vmlinux update_cfs_shares (237 bytes)
0.52% 0.66% ip_tables.ko.gz ipt_do_table (607 bytes)
0.49% 0.55% C2, level 4 io.netty.buffer.PooledByteBuf::deallocate, version 2613 (657 bytes)
0.48% 0.87% C2, level 4 io.netty.buffer.PooledByteBufAllocator::newDirectBuffer, version 3643 (752 bytes)
84.14% 77.77% <...other 4756 warm regions...>
....................................................................................................
100.00% 100.00% <totals>
....[Hottest Methods (after inlining)]..............................................................
5.28% 3.54% C2, level 4 com.linecorp.armeria.client.HttpClientDelegate::execute, version 6360
5.20% 10.13% perf-25092.map [unknown]
2.38% 3.22% C1, level 1 com.linecorp.armeria.common.DefaultHttpData::offset, version 1975
1.96% 1.53% vmlinux _raw_spin_lock_irqsave
1.54% 3.91% C2, level 4 java.util.regex.Pattern$Start::match, version 2455
1.49% 1.47% C2, level 4 com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark::empty, version 5931
1.47% 1.39% C2, level 4 java.lang.Class::getName, version 1214
1.22% 4.04% C2, level 4 java.util.regex.Pattern$Slice::match, version 1956
0.87% 0.95% [vdso] [unknown]
0.84% 0.90% C2, level 4 io.netty.buffer.PoolArena::allocate, version 6297
0.81% 0.72% C2, level 4 io.netty.channel.AbstractChannel$AbstractUnsafe::flush0, version 2881
0.80% 0.18% vmlinux entry_SYSCALL_64
0.75% 0.60% C2, level 4 io.netty.channel.DefaultChannelPromise::addListener, version 2472
0.70% 0.64% C2, level 4 io.netty.buffer.PooledByteBuf::deallocate, version 2613
0.67% 0.26% vmlinux __fget
0.65% 1.08% C2, level 4 io.netty.handler.codec.http2.HpackEncoder::encodeHeader, version 2940
0.64% 0.68% C2, level 4 io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState::writeAllocatedBytes, version 4510
0.64% 0.99% vmlinux nf_iterate
0.63% 0.93% ip_tables.ko.gz ipt_do_table
0.62% 0.71% C2, level 4 com.linecorp.armeria.client.HttpRequestSubscriber::write0, version 4511
70.84% 62.09% <...other 1285 warm methods...>
....................................................................................................
100.00% 99.97% <totals>
....[Distribution by Source]........................................................................
48.99% 50.73% C2, level 4
29.94% 24.27% vmlinux
5.20% 10.13% perf-25092.map
3.44% 3.86% C1, level 1
3.35% 2.69% libjvm.so
2.04% 2.27% nf_conntrack.ko.gz
1.29% 0.65% libc-2.26.so
1.03% 0.38% libpthread-2.26.so
0.87% 0.95% [vdso]
0.70% 0.41% Unknown, level 0
0.63% 0.93% ip_tables.ko.gz
0.59% 0.63% runtime stub
0.48% 0.47% C1, level 3
0.34% 0.15% libnetty_transport_native_epoll_x86_645657805159735482949.so
0.31% 0.63% nf_nat_ipv4.ko.gz
0.26% 0.26% nf_conntrack_ipv4.ko.gz
0.13% 0.26% nf_nat.ko.gz
0.12% 0.15% xt_multiport.ko.gz
0.08% 0.06% nf_defrag_ipv4.ko.gz
0.07% 0.04% iptable_filter.ko.gz
0.05% 0.09% iptable_nat.ko.gz
0.03% 0.01% nvidia.ko
0.02% 0.00% br_netfilter.ko.gz
0.02% 0.00% interpreter
0.01% 0.00% C1, level 2
0.00% xhci-hcd.ko.gz
0.00% e1000e.ko.gz
0.00% 0.00% usbcore.ko.gz
0.00% intel-cstate.ko.gz
0.00% bridge.ko.gz
0.00% ext4.ko.gz
0.00% usb-common.ko.gz
0.00% btusb.ko.gz
....................................................................................................
100.00% 100.00% <totals>
Armeria simple()
:
Result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.simple":
1136.192 ±(99.9%) 15.340 ops/s [Average]
(min, avg, max) = (922.771, 1136.192, 1182.365), stdev = 45.231
CI (99.9%): [1120.852, 1151.532] (assumes normal distribution)
Secondary result "com.linecorp.armeria.grpc.downstream.DownstreamSimpleBenchmark.simple:·asm":
PrintAssembly processed: 1720336 total address lines.
Perf output processed (skipped 40.027 seconds):
Column 1: cycles (268791 events)
Column 2: instructions (265746 events)
WARNING: No hottest code region above the threshold (10.00%) for disassembly.
Use "hotThreshold" profiler option to lower the filter threshold.
....[Hottest Regions]...............................................................................
9.81% 10.12% runtime stub StubRoutines::jbyte_disjoint_arraycopy (205 bytes)
6.83% 16.91% C2, level 4 sun.nio.cs.UTF_8$Decoder::decode, version 669 (338 bytes)
5.75% 7.23% C2, level 4 com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder::readStringRequireUtf8, version 1923 (665 bytes)
5.51% 12.60% C2, level 4 com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1911 (390 bytes)
4.31% 4.01% C2, level 4 com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder::readStringRequireUtf8, version 1923 (415 bytes)
2.66% 2.98% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::<init>, version 2120 (689 bytes)
2.58% 3.39% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 3003 (3996 bytes)
2.09% 2.54% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::<init>, version 2120 (2624 bytes)
2.04% 1.97% runtime stub StubRoutines::jshort_disjoint_arraycopy (209 bytes)
1.85% 0.72% runtime stub StubRoutines::jlong_disjoint_arraycopy (145 bytes)
1.64% 0.03% vmlinux copy_user_enhanced_fast_string (16 bytes)
1.58% 1.28% libjvm.so TypeArrayKlass::allocate_common (66 bytes)
1.18% 2.14% runtime stub StubRoutines::unsafe_arraycopy (54 bytes)
1.04% 1.17% C2, level 4 com.google.protobuf.GeneratedMessageV3::writeString, version 1927 (192 bytes)
0.99% 0.28% C2, level 4 com.linecorp.armeria.client.grpc.ArmeriaClientCall::start, version 6439 (9 bytes)
0.98% 0.72% C2, level 4 com.google.protobuf.ByteString::isEmpty, version 1859 (57 bytes)
0.94% 1.64% C2, level 4 com.google.protobuf.GeneratedMessageV3::writeString, version 1927 (92 bytes)
0.65% 0.00% C2, level 4 com.linecorp.armeria.internal.grpc.ArmeriaMessageFramer::writePayload, version 6159 (111 bytes)
0.64% 0.65% C2, level 4 com.google.protobuf.GeneratedMessageV3::writeString, version 1927 (89 bytes)
0.61% 0.37% runtime stub StubRoutines::jint_disjoint_arraycopy (130 bytes)
46.32% 29.25% <...other 8152 warm regions...>
....................................................................................................
100.00% 100.00% <totals>
....[Hottest Methods (after inlining)]..............................................................
10.07% 11.24% C2, level 4 com.google.protobuf.CodedInputStream$UnsafeDirectNioDecoder::readStringRequireUtf8, version 1923
9.81% 10.12% runtime stub StubRoutines::jbyte_disjoint_arraycopy
6.83% 16.91% C2, level 4 sun.nio.cs.UTF_8$Decoder::decode, version 669
5.59% 6.62% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::<init>, version 2120
5.51% 12.60% C2, level 4 com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1911
4.35% 2.81% C2, level 4 com.linecorp.armeria.client.grpc.ArmeriaClientCall::start, version 6439
3.73% 3.03% perf-27419.map [unknown]
3.56% 4.25% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 3003
2.61% 3.45% C2, level 4 com.google.protobuf.GeneratedMessageV3::writeString, version 1927
2.04% 1.97% runtime stub StubRoutines::jshort_disjoint_arraycopy
1.85% 0.72% runtime stub StubRoutines::jlong_disjoint_arraycopy
1.64% 1.29% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$User::writeTo, version 3004
1.64% 0.03% vmlinux copy_user_enhanced_fast_string
1.61% 1.29% libjvm.so TypeArrayKlass::allocate_common
1.18% 2.14% runtime stub StubRoutines::unsafe_arraycopy
1.01% 0.94% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$User::<init>, version 2143
0.98% 0.72% C2, level 4 com.google.protobuf.ByteString::isEmpty, version 1859
0.83% 0.52% C1, level 3 io.netty.util.ResourceLeakDetector::track0, version 2166
0.70% 0.01% C2, level 4 com.linecorp.armeria.internal.grpc.ArmeriaMessageFramer::writePayload, version 6159
0.61% 0.37% runtime stub StubRoutines::jint_disjoint_arraycopy
33.84% 18.93% <...other 1681 warm methods...>
....................................................................................................
100.00% 99.96% <totals>
....[Distribution by Source]........................................................................
58.07% 70.31% C2, level 4
15.58% 15.42% runtime stub
12.43% 4.73% vmlinux
4.68% 3.82% libjvm.so
3.73% 3.03% perf-27419.map
1.79% 1.04% C1, level 3
0.95% 0.56% libc-2.26.so
0.54% 0.25% nf_conntrack.ko.gz
0.48% 0.15% C1, level 1
0.45% 0.16% Unknown, level 0
0.35% 0.12% libpthread-2.26.so
0.24% 0.11% [vdso]
0.16% 0.10% ip_tables.ko.gz
0.11% 0.01% interpreter
0.11% 0.03% libnetty_transport_native_epoll_x86_647019244700163940556.so
0.09% 0.07% nf_nat_ipv4.ko.gz
0.08% 0.02% nf_conntrack_ipv4.ko.gz
0.03% 0.03% nf_nat.ko.gz
0.03% 0.00% nvidia.ko
0.03% 0.00% nf_defrag_ipv4.ko.gz
0.02% 0.02% xt_multiport.ko.gz
0.02% 0.00% iptable_filter.ko.gz
0.02% 0.01% iptable_nat.ko.gz
0.01% 0.00% br_netfilter.ko.gz
0.00% xhci-hcd.ko.gz
0.00% usbcore.ko.gz
0.00% kvm.ko.gz
0.00% 0.00%
0.00% e1000e.ko.gz
0.00% libnet.so
0.00% vmnet.ko
0.00% libahci.ko.gz
0.00% bridge.ko.gz
0.00% ext4.ko.gz
0.00% bluetooth.ko.gz
0.00% 0.00% C1, level 2
0.00% aacraid.ko.gz
0.00% usb-common.ko.gz
....................................................................................................
100.00% 100.00% <totals>
Benchmark Mode Cnt Score Error Units
DownstreamSimpleBenchmark.empty thrpt 100 13193.483 ± 239.014 ops/s
DownstreamSimpleBenchmark.empty:·asm thrpt NaN ---
DownstreamSimpleBenchmark.simple thrpt 100 1136.192 ± 15.340 ops/s
DownstreamSimpleBenchmark.simple:·asm thrpt NaN ---
I'm curious what would the numbers be if we use the same client for both upstream and downstream.
Upstream empty()
:
Result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.empty":
22322.742 ±(99.9%) 256.653 ops/s [Average]
(min, avg, max) = (19395.717, 22322.742, 23688.237), stdev = 756.747
CI (99.9%): [22066.089, 22579.396] (assumes normal distribution)
Secondary result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.empty:·asm":
PrintAssembly processed: 1020749 total address lines.
Perf output processed (skipped 36.472 seconds):
Column 1: cycles (302596 events)
Column 2: instructions (312109 events)
WARNING: No hottest code region above the threshold (10.00%) for disassembly.
Use "hotThreshold" profiler option to lower the filter threshold.
....[Hottest Regions]...............................................................................
2.24% 2.46% vmlinux _raw_spin_lock_irqsave (235 bytes)
1.20% 0.48% vmlinux entry_SYSCALL_64 (196 bytes)
1.02% 1.04% C2, level 4 io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2422 (1083 bytes)
0.98% 0.66% vmlinux __fget (98 bytes)
0.90% 0.86% vmlinux update_cfs_shares (237 bytes)
0.84% 0.16% C2, level 4 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject::await, version 2565 (451 bytes)
0.81% 0.60% C1, level 3 java.util.concurrent.CountDownLatch$Sync::tryAcquireShared, version 4536 (63 bytes)
0.78% 0.99% C2, level 4 io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 2993 (1030 bytes)
0.75% 0.20% C1, level 3 java.io.ObjectStreamClass::lookup, version 4506 (9 bytes)
0.73% 0.38% vmlinux native_write_msr (11 bytes)
0.73% 1.45% vmlinux _raw_spin_trylock (47 bytes)
0.67% 1.31% C2, level 4 io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 1780 (1800 bytes)
0.67% 0.28% C2, level 4 io.netty.util.concurrent.SingleThreadEventExecutor::execute, version 2535 (416 bytes)
0.64% 1.42% [vdso] [unknown] (29 bytes)
0.63% 0.75% C2, level 4 io.netty.channel.nio.NioEventLoop::run, version 4399 (643 bytes)
0.59% 0.86% C2, level 4 io.grpc.stub.ClientCalls::blockingUnaryCall, version 3976 (4321 bytes)
0.57% 0.24% vmlinux cpuacct_charge (109 bytes)
0.56% 0.40% vmlinux mutex_unlock (29 bytes)
0.52% 0.64% C2, level 4 io.grpc.internal.TransportFrameUtil::toHttp2Headers, version 3221 (1483 bytes)
0.49% 0.78% C2, level 4 io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 1780 (1587 bytes)
83.69% 84.04% <...other 3526 warm regions...>
....................................................................................................
100.00% 100.00% <totals>
....[Hottest Methods (after inlining)]..............................................................
3.05% 2.42% C2, level 4 sun.nio.ch.SelectorImpl::lockAndDoSelect, version 2160
2.24% 2.46% vmlinux _raw_spin_lock_irqsave
2.20% 1.17% C1, level 3 java.io.ObjectStreamClass::lookup, version 4506
1.84% 2.62% C2, level 4 io.grpc.internal.ClientCallImpl::start, version 4127
1.70% 3.21% C2, level 4 io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 1780
1.63% 1.44% C1, level 3 java.util.concurrent.CountDownLatch$Sync::tryAcquireShared, version 4536
1.24% 1.51% C2, level 4 io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener::onHeadersRead, version 3367
1.20% 1.84% C2, level 4 io.grpc.stub.ClientCalls::blockingUnaryCall, version 3976
1.20% 0.48% vmlinux entry_SYSCALL_64
1.10% 1.72% [vdso] [unknown]
1.07% 1.11% C2, level 4 io.netty.buffer.PoolThreadCache$MemoryRegionCache::allocate, version 2422
1.00% 0.38% C2, level 4 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject::await, version 2565
0.98% 0.66% vmlinux __fget
0.92% 0.76% C2, level 4 io.grpc.stub.ClientCalls$ThreadlessExecutor::waitAndDrain, version 3963
0.91% 0.37% C2, level 4 io.netty.util.concurrent.SingleThreadEventExecutor::execute, version 2535
0.91% 1.07% C2, level 4 io.netty.buffer.AbstractReferenceCountedByteBuf::release, version 1631
0.90% 0.86% vmlinux update_cfs_shares
0.90% 0.71% C2, level 4 io.netty.util.concurrent.SingleThreadEventExecutor::runAllTasks, version 2478
0.89% 1.05% C2, level 4 io.netty.buffer.AbstractByteBufAllocator::directBuffer, version 2993
0.82% 1.06% C2, level 4 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe::flush0, version 2343
73.29% 73.08% <...other 1076 warm methods...>
....................................................................................................
100.00% 99.97% <totals>
....[Distribution by Source]........................................................................
43.92% 49.12% C2, level 4
37.54% 35.51% vmlinux
3.98% 2.75% C1, level 3
3.53% 3.42% libjvm.so
2.68% 1.83% libpthread-2.26.so
1.23% 0.62% Unknown, level 0
1.17% 0.91% nf_conntrack.ko.gz
1.10% 1.72% [vdso]
1.01% 0.62% libc-2.26.so
0.75% 0.66% C1, level 1
0.74% 0.85% runtime stub
0.68% 0.72% perf-31646.map
0.64% 0.43%
0.34% 0.30% ip_tables.ko.gz
0.25% 0.25% nf_nat_ipv4.ko.gz
0.15% 0.05% nf_conntrack_ipv4.ko.gz
0.07% 0.09% xt_multiport.ko.gz
0.05% 0.10% nf_nat.ko.gz
0.05% 0.01% nf_defrag_ipv4.ko.gz
0.03% 0.02% iptable_filter.ko.gz
0.03% 0.00% nvidia.ko
0.02% 0.03% iptable_nat.ko.gz
0.02% 0.00% interpreter
0.01% 0.00% br_netfilter.ko.gz
0.00% xhci-hcd.ko.gz
0.00% e1000e.ko.gz
0.00% libahci.ko.gz
0.00% usbcore.ko.gz
0.00% 0.00% C1, level 2
0.00% ld-2.26.so
0.00% kvm.ko.gz
0.00% vmnet.ko
0.00% bridge.ko.gz
0.00% ext4.ko.gz
0.00% btusb.ko.gz
....................................................................................................
100.00% 100.00% <totals>
Upstream simple()
:
Result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.simple":
1741.894 ±(99.9%) 11.321 ops/s [Average]
(min, avg, max) = (1606.254, 1741.894, 1788.062), stdev = 33.380
CI (99.9%): [1730.573, 1753.215] (assumes normal distribution)
Secondary result "com.linecorp.armeria.grpc.upstream.UpstreamSimpleBenchmark.simple:·asm":
PrintAssembly processed: 1328204 total address lines.
Perf output processed (skipped 36.927 seconds):
Column 1: cycles (268779 events)
Column 2: instructions (259372 events)
Hottest code regions (>10.00% "cycles" events):
....[Hottest Region 1]..............................................................................
C2, level 4, sun.nio.cs.UTF_8$Decoder::decode, version 671 (338 bytes)
...
....................................................................................................
11.43% 19.59% <total for region 1>
....[Hottest Regions]...............................................................................
11.43% 19.59% C2, level 4 sun.nio.cs.UTF_8$Decoder::decode, version 671 (338 bytes)
8.37% 16.02% C2, level 4 com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1070 (390 bytes)
5.71% 5.10% runtime stub StubRoutines::jbyte_disjoint_arraycopy (205 bytes)
5.48% 5.54% C2, level 4 com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067 (379 bytes)
5.00% 0.21% vmlinux copy_user_enhanced_fast_string (16 bytes)
4.49% 4.01% C2, level 4 com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067 (407 bytes)
3.80% 3.45% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::<init>, version 1270 (2476 bytes)
3.30% 1.99% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::<init>, version 1270 (559 bytes)
1.90% 2.15% C2, level 4 com.google.protobuf.GeneratedMessageV3::writeString, version 1085 (338 bytes)
1.28% 1.83% C2, level 4 com.google.protobuf.GeneratedMessageV3::writeString, version 1085 (517 bytes)
1.15% 0.89% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 1758 (4788 bytes)
1.05% 0.28% libjvm.so SpinPause (9 bytes)
1.00% 0.53% C2, level 4 com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067 (120 bytes)
0.72% 0.50% C2, level 4 com.google.protobuf.ByteString::isEmpty, version 1039 (57 bytes)
0.66% 0.68% perf-1033.map [unknown] (773 bytes)
0.51% 0.40% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::<init>, version 1270 (466 bytes)
0.50% 0.43% vmlinux _raw_spin_lock_irqsave (46 bytes)
0.46% 0.42% vmlinux native_write_msr (11 bytes)
0.41% 0.32% libjvm.so ParallelTaskTerminator::offer_termination (111 bytes)
0.34% 0.13% vmlinux __fget (98 bytes)
42.42% 35.52% <...other 5264 warm regions...>
....................................................................................................
100.00% 100.00% <totals>
....[Hottest Methods (after inlining)]..............................................................
11.43% 19.59% C2, level 4 sun.nio.cs.UTF_8$Decoder::decode, version 671
10.98% 10.08% C2, level 4 com.google.protobuf.CodedInputStream$ArrayDecoder::readStringRequireUtf8, version 1067
8.57% 6.79% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::<init>, version 1270
8.37% 16.02% C2, level 4 com.google.protobuf.Utf8$UnsafeProcessor::partialIsValidUtf8, version 1070
5.71% 5.10% runtime stub StubRoutines::jbyte_disjoint_arraycopy
5.00% 0.21% vmlinux copy_user_enhanced_fast_string
3.92% 3.21% perf-1033.map [unknown]
3.18% 3.98% C2, level 4 com.google.protobuf.GeneratedMessageV3::writeString, version 1085
1.51% 1.24% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$SearchResponse$Item::writeTo, version 1758
1.05% 0.28% libjvm.so SpinPause
0.85% 0.61% C2, level 4 com.linecorp.armeria.benchmarks.GithubApi$User::writeTo, version 1759
0.83% 0.57% C2, level 4 sun.nio.ch.SelectorImpl::lockAndDoSelect, version 3061
0.72% 0.50% C2, level 4 com.google.protobuf.ByteString::isEmpty, version 1039
0.54% 0.75% C2, level 4 io.netty.channel.DefaultChannelPipeline$HeadContext::flush, version 3065
0.50% 0.43% vmlinux _raw_spin_lock_irqsave
0.47% 0.44% C1, level 3 java.io.UnixFileSystem::prefixLength, version 34
0.46% 0.31% C2, level 4 io.netty.buffer.AbstractReferenceCountedByteBuf::release0, version 4297
0.46% 0.42% vmlinux native_write_msr
0.41% 0.32% libjvm.so ParallelTaskTerminator::offer_termination
0.35% 0.48% C2, level 4 io.netty.handler.codec.http2.HpackDecoder::indexHeader, version 2728
34.66% 28.59% <...other 1574 warm methods...>
....................................................................................................
100.00% 99.92% <totals>
....[Distribution by Source]........................................................................
61.44% 73.04% C2, level 4
18.64% 11.48% vmlinux
6.25% 5.32% runtime stub
4.68% 3.12% libjvm.so
3.92% 3.21% perf-1033.map
1.04% 0.93% C1, level 3
0.78% 0.57% C1, level 1
0.59% 0.29% libpthread-2.26.so
0.59% 0.54% nf_conntrack.ko.gz
0.44% 0.21% libc-2.26.so
0.38% 0.13% Unknown, level 0
0.34% 0.40% [vdso]
0.22% 0.11% libnio.so
0.18% 0.22% ip_tables.ko.gz
0.09% 0.14% nf_nat_ipv4.ko.gz
0.08% 0.05% nf_conntrack_ipv4.ko.gz
0.07% 0.04% C1, level 2
0.04% 0.08% nf_nat.ko.gz
0.04% 0.05% xt_multiport.ko.gz
0.04% 0.00% interpreter
0.03% 0.01% nvidia.ko
0.03% 0.03% iptable_nat.ko.gz
0.02% 0.01% nf_defrag_ipv4.ko.gz
0.01% 0.01% iptable_filter.ko.gz
0.01% 0.01% hsdis-amd64.so
0.01% 0.00% br_netfilter.ko.gz
0.01% libjava.so
0.00% e1000e.ko.gz
0.00% kvm.ko.gz
0.00%
0.00% vmnet.ko
0.00% xhci-hcd.ko.gz
0.00% libmanagement.so
....................................................................................................
100.00% 100.00% <totals>
Benchmark Mode Cnt Score Error Units
UpstreamSimpleBenchmark.empty thrpt 100 22322.742 ± 256.653 ops/s
UpstreamSimpleBenchmark.empty:·asm thrpt NaN ---
UpstreamSimpleBenchmark.simple thrpt 100 1741.894 ± 11.321 ops/s
UpstreamSimpleBenchmark.simple:·asm thrpt NaN ---
If this is accepted, performance of simple()
will become much closer to upstream, but has no effect on empty()
https://github.com/google/protobuf/pull/3824
Ran emptyNonBlocking
on a Macbook Pro from #815
The results are complex. UpstreamSimpleBenchmark.emptyNonBlocking
completes 0 requests during the benchmark run. I debugged the execution and noticed that the same event loop thread was used by the client and the server - this causes the continuous requests from the client to starve the server from being able to return anything until the requests stop coming at the end of the benchmark. As this is only possible in a benchmark where both client + server are in the same JVM, I'll try to see if I can work around this.
In the meantime, the okhttp benchmarks do not have this problem and are probably more useful. Both servers end up basically being DoS-ed by the client, but poor armeria-grpc seems to handle this situation much worse.
Benchmark (clientType) Mode Cnt Score Error Units
DownstreamSimpleBenchmark.emptyNonBlocking NORMAL thrpt 20 32838.177 ± 1469.516 ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:currentRequests NORMAL thrpt 20 ≈ 0 ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numFailures NORMAL thrpt 20 ≈ 0 ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numRequests NORMAL thrpt 20 11556.174 ± 1000.864 ops/s
DownstreamSimpleBenchmark.emptyNonBlocking OKHTTP thrpt 20 114116.049 ± 6711.687 ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:currentRequests OKHTTP thrpt 20 ≈ 0 ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numFailures OKHTTP thrpt 20 ≈ 0 ops/s
DownstreamSimpleBenchmark.emptyNonBlocking:numRequests OKHTTP thrpt 20 910.325 ± 752.228 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking NORMAL thrpt 20 104237.020 ± 25718.438 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:currentRequests NORMAL thrpt 20 ≈ 0 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numFailures NORMAL thrpt 20 ≈ 0 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numRequests NORMAL thrpt 20 ≈ 0 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking OKHTTP thrpt 20 111407.822 ± 6739.010 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:currentRequests OKHTTP thrpt 20 ≈ 0 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numFailures OKHTTP thrpt 20 ≈ 0 ops/s
UpstreamSimpleBenchmark.emptyNonBlocking:numRequests OKHTTP thrpt 20 11053.295 ± 1844.917 ops/s
I found in #816 that the number of active threads is completely different for upstream and downstream, so comparing results for NORMAL will be quite difficult until that is resolved.
JFR profiles of DownstreamSimpleBenchmark.empty
/ emptyNonBlocking
. Tried taking them with jmh-profilers
package, but
Possible hotspots (may be more, just a skim)
PriorityQueue.remove
) - with many concurrent requests, removing from a priority queue is very slow. Possibly better to just not remove and let the DONE
check skip code when the callback is called, if there isn't a better solution.Stream message processing (ConcurrentLinkedQueue
) - possibly due to poor JIT or poor caching. As most messages will just have a few messages (headers, body, trailers, flush) I wonder if it's possible to inline the first few nodes of the queue into StreamMessage
itself.
Media type parsing / matching (MediaType.parse
) - as gRPC media types are well defined, I'll replace the parsing with string matches.
ArmeriaHttpUtil.splitPathAndQuery
(Matcher.find
) - PROHIBITED_PATH_PATTERN
is pretty intense, wonder if there's a faster implementation than regex.
HttpHeaders.get
(AsciiString.contentEquals
) - Netty and armeria both intern headers, but they are different intern. Though netty may not be an explicit intern, but a result of hpack returning the same values for a stream. Would be nice if we could take advantage of this though.
When going through netty PRs, randomly found this one which could potentially help with gRPC streams by reducing flushing. Seems to be in limbo though.
@anuraaga We already have something similar at the beginning of the pipeline: com.linecorp.armeria.internal.FlushConsolidationHandler
.
Still lack of time for tackling into this sadly..
Ah didn't know about that class - in that case my hypothesis about over-flushing is probably not correct :)
Ideas from @demongaorui - https://github.com/line/armeria/pull/1309#discussion_r206025752
Lately I've been looking into this again, but haven't made much progress - there isn't anything obvious in profiling. I'm hoping in the end much of the difference is still some misconfiguration of the benchmark, but I have no leads. I'd really like to be able to say armeria gRPC performance is similar to upstream - @trustin would you happen to have some time to help look into the performance?
I recently got a report about gRPC performance from an internal customer and am waiting for detailed performance profile. That being said, I'd like to take a stab on this issue soon.
async-profiler flame graph for emptyNonBlocking()
retrieved from a beefy machine: https://gist.github.com/trustin/52c5014a04331cf4c6a452e2cd6880b3
On the other hand, I ran grpc.upstream.UpstreamSimpleBenchmark.emptyNonBlocking()
on the same machine and I got much less numbers - 110k (upstream) vs 160k (downstream). I gave -Xmx16384m -Xms16384m
JVM option which may have made JVM happy. @anuraaga Can you reproduce the result?
Upstream seems to perform better at empty()
though, it's 11k (upstrema) vs 8k (downstream).
I'll try to run it too - I don't know if I've tried since disabling request timeout but that was a huge problem in my first runs.
Also, nonBlocking is basically a death test so results varied quite a lot from run to run, it's why lately I've focused on blocking version with many threads, which gives more stable results.
By the way if it's not sensitive, what sort of report did you get from the internal user? I guess they may have migrated from upstream to armeria and saw a difference in CPU usage - if so would be cool hearing around how much difference there was in a real world migration.
They started with Armeria so there's no comparison with upstream unfortunately. They observed high CPU usage but it seemed like other third party library was also outstanding, so I'm not sure Armeria is really a bottleneck. I just want to make sure that Armeria performs close to the baseline performance.
cpu, alloc, lock, cache-misses profile for grpc.downstream.DownstreamSimpleBenchmark.empty()
with 16 thread: https://gist.github.com/trustin/395659be9d0eb3d5265cbff50a1581c5
Oh, just found that we're in TechEmpower Benchmarks: https://www.techempower.com/benchmarks/#section=data-r18&hw=ph&test=plaintext Not the best numbers but I hope it will get much better once we start optimizing seriously.
Mentioned it in #2295, but one aspect of having competitive gRPC performance will be to optimize as much of the functionality we have that is not used by RPC frameworks, for example routing.
One idea I haven't explored much is if it's possible to have an especially optimized code path when we know the server does not support complex routing. For example, it should be possible for this object to have less memory footprint when not using all the different features
Just a brainstorm :) /cc @trustin @minwoox @ikhoon
That's a good idea. For example, we might be able to optimize a lot when there are only simple routes.
Oops, I clicked down(👎) by mistake. 😅
I noticed some reasonably large differences in performance between armeria-grpc and upstream grpc on the same business logic and did some profiling - want to capture them and maybe start an investigation. My initial suspicion is flushing on every write?
I've attached VisualVm profiles, and here's a screenshot of hot spots in the armeria version. I don't believe
writevAddress
,readAddress
,invokeFlush0
are supposed to be so high here.visualvm-profiles.tar.gz