hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu
https://www.hyperledger.org/projects/besu
Apache License 2.0
1.52k stars 839 forks source link

High Fork Rate since v22.7.3 #4500

Closed gitmewai closed 2 years ago

gitmewai commented 2 years ago

Description

High fork rate detected on besu since v22.7.3 (and the same at v22.7.5)

Acceptance Criteria

High fork rate is not detected before besu v22.7.2 or earlier

Steps to Reproduce (Bug)

  1. Upgrade node to besu v22.7.3 v22.7.4 or v22.7.5
  2. collectd reports high fork rate
  3. Fallback to v22.7.2 and high fork rate will not be observed

Expected behavior: Fork rate should not be high

Actual behavior: Fork rate is high with the recent releases Goerli testnet running v22.7.3 since 26-Sep-2022: image

Mainnet just upgraded to v22.7.3 on 27-Sep-2022, then fallback to v22.7.2 on the same day, upgraded to v22.7.5 since 7-Oct-2022: image

Frequency: Occurs on all releases since v22.7.3

Versions (Add all that apply)

Smart contract information (If you're reporting an issue arising from deploying or calling a smart contract, please supply related information)

Additional Information (Add any of the following or anything else that may be relevant)

Besu config:

data-path="/home/besu/besu-data" # Path
logging="INFO"
color-enabled=false
p2p-enabled=true
p2p-host="x.x.x.x"
p2p-port=30303
max-peers=30
host-allowlist=["all"]
engine-host-allowlist=["all"]
engine-jwt-enabled=true
engine-jwt-secret="/etc/eth2/jwtsecret"
network="goerli"
sync-mode="full"
rpc-http-enabled=true
rpc-http-host="x.x.x.x"
rpc-http-port=8545
engine-rpc-enabled=true
engine-rpc-port=8551
rpc-http-api=["ADMIN","ETH","NET","PRIV","TRACE","WEB3","TXPOOL"]
rpc-http-apis=["ADMIN","ETH","NET","PRIV","TRACE","WEB3","TXPOOL"]
rpc-http-cors-origins=["*"]
rpc-http-authentication-enabled=false
privacy-tls-enabled=false
graphql-http-enabled=false
rpc-ws-enabled=false
metrics-enabled=false
miner-enabled=false
pruning-enabled=false
permissions-nodes-config-file-enabled=false
privacy-enabled=false

Teku config:

network: "prater"
p2p-enabled: True
p2p-interface: "x.x.x.x"
p2p-port: 9000
p2p-discovery-enabled: True
p2p-advertised-ip: "x.x.x.x"
p2p-advertised-port: 9000
initial-state: /var/lib/teku/beacon/genesis.ssz
eth1-endpoints: ["http://x.x.x.x:8545"]
ee-endpoint: "http://x.x.x.x:8551"
ee-jwt-secret-file: "/etc/eth2/jwtsecret"
log-color-enabled: False
log-include-events-enabled: True
log-destination: "both"
log-file: "/var/lib/teku/logs/teku.log"
log-file-name-pattern: "teku_%d{yyyy-MM-dd}.log"
metrics-enabled: True
metrics-categories: ["BEACON", "LIBP2P", "NETWORK", "EVENTBUS", "JVM", "PROCESS"]
metrics-interface: "127.0.0.1"
metrics-port: 8008
data-base-path: "/var/lib/teku"
data-beacon-path: "/var/lib/teku/beacon"
data-storage-archive-frequency: 1
data-storage-mode: "archive"
rest-api-port: 5051
rest-api-docs-enabled: False
rest-api-enabled: True
rest-api-interface: "127.0.0.1"

pstree -glpt ### for besu and teku processes

             ├─java(10580,10580)─┬─{C1 CompilerThre}(10608,10580)
             │                   ├─{C2 CompilerThre}(10607,10580)
             │                   ├─{Common-Cleaner}(10610,10580)
             │                   ├─{EthScheduler-Ti}(10673,10580)
             │                   ├─{Finalizer}(10604,10580)
             │                   ├─{ForkJoinPool.co}(10711,10580)
             │                   ├─{ForkJoinPool.co}(12993,10580)
             │                   ├─{ForkJoinPool.co}(16401,10580)
             │                   ├─{G1 Conc#0}(10599,10580)
             │                   ├─{G1 Main Marker}(10598,10580)
             │                   ├─{G1 Refine#0}(10600,10580)
             │                   ├─{G1 Young RemSet}(10601,10580)
             │                   ├─{GC Thread#0}(10597,10580)
             │                   ├─{GC Thread#1}(10613,10580)
             │                   ├─{GC Thread#2}(10614,10580)
             │                   ├─{GC Thread#3}(10615,10580)
             │                   ├─{JNA Cleaner}(10618,10580)
             │                   ├─{Reference Handl}(10603,10580)
             │                   ├─{Service Thread}(10606,10580)
             │                   ├─{Signal Dispatch}(10605,10580)
             │                   ├─{Sweeper thread}(10609,10580)
             │                   ├─{Timer-0}(10674,10580)
             │                   ├─{VM Periodic Tas}(10611,10580)
             │                   ├─{VM Thread}(10602,10580)
             │                   ├─{dnsjava NIO sel}(10675,10580)
             │                   ├─{java}(10596,10580)
             │                   ├─{java}(10664,10580)
             │                   ├─{nioEventLoopGro}(10676,10580)
             │                   ├─{nioEventLoopGro}(10687,10580)
             │                   ├─{nioEventLoopGro}(10690,10580)
             │                   ├─{nioEventLoopGro}(10702,10580)
             │                   ├─{nioEventLoopGro}(10703,10580)
             │                   ├─{nioEventLoopGro}(10704,10580)
             │                   ├─{nioEventLoopGro}(10705,10580)
             │                   ├─{nioEventLoopGro}(10706,10580)
             │                   ├─{nioEventLoopGro}(10707,10580)
             │                   ├─{nioEventLoopGro}(10708,10580)
             │                   ├─{nioEventLoopGro}(10709,10580)
             │                   ├─{pool-8-thread-1}(10685,10580)
             │                   ├─{rocksdb:high}(10633,10580)
             │                   ├─{rocksdb:low}(10629,10580)
             │                   ├─{rocksdb:low}(10630,10580)
             │                   ├─{rocksdb:low}(10631,10580)
             │                   ├─{rocksdb:low}(10632,10580)
             │                   ├─{vert.x-acceptor}(10672,10580)
             │                   ├─{vert.x-eventloo}(10666,10580)
             │                   ├─{vert.x-eventloo}(10667,10580)
             │                   ├─{vert.x-eventloo}(10669,10580)
             │                   ├─{vert.x-eventloo}(10671,10580)
             │                   ├─{vert.x-eventloo}(10692,10580)
             │                   ├─{vert.x-eventloo}(10713,10580)
             │                   ├─{vert.x-eventloo}(10730,10580)
             │                   ├─{vert.x-eventloo}(10742,10580)
             │                   ├─{vert.x-eventloo}(10747,10580)
             │                   ├─{vert.x-eventloo}(10781,10580)
             │                   ├─{vert.x-eventloo}(10910,10580)
             │                   ├─{vert.x-eventloo}(10926,10580)
             │                   ├─{vert.x-eventloo}(26942,10580)
             │                   ├─{vert.x-worker-t}(10678,10580)
             │                   ├─{vert.x-worker-t}(10679,10580)
             │                   ├─{vert.x-worker-t}(10680,10580)
             │                   ├─{vert.x-worker-t}(10681,10580)
             │                   ├─{vert.x-worker-t}(10682,10580)
             │                   ├─{vert.x-worker-t}(10683,10580)
             │                   ├─{vert.x-worker-t}(10684,10580)
             │                   ├─{vert.x-worker-t}(10686,10580)
             │                   ├─{vert.x-worker-t}(10688,10580)
             │                   ├─{vert.x-worker-t}(10689,10580)
             │                   ├─{vert.x-worker-t}(10691,10580)
             │                   ├─{vert.x-worker-t}(10693,10580)
             │                   ├─{vert.x-worker-t}(10694,10580)
             │                   ├─{vert.x-worker-t}(10695,10580)
             │                   ├─{vert.x-worker-t}(10696,10580)
             │                   ├─{vert.x-worker-t}(10697,10580)
             │                   ├─{vert.x-worker-t}(10698,10580)
             │                   ├─{vert.x-worker-t}(10699,10580)
             │                   ├─{vert.x-worker-t}(10700,10580)
             │                   ├─{vert.x-worker-t}(10701,10580)
             │                   ├─{vert.x-worker-t}(10712,10580)
             │                   ├─{vertx-blocked-t}(10619,10580)
             │                   ├─{vertx-blocked-t}(10668,10580)
             │                   └─{vertx-blocked-t}(10670,10580)
             ├─java(747,747)─┬─{AggregatingSign}(3122,747)
             │               ├─{AggregatingSign}(3123,747)
             │               ├─{BlockGossipChan}(3182,747)
             │               ├─{BlockImportChan}(2792,747)
             │               ├─{BlockImportNoti}(2789,747)
             │               ├─{BlockImportNoti}(2793,747)
             │               ├─{C1 CompilerThre}(1157,747)
             │               ├─{C2 CompilerThre}(1156,747)
             │               ├─{ChainHeadChanne}(2849,747)
             │               ├─{ChainHeadChanne}(2871,747)
             │               ├─{ChainHeadChanne}(3120,747)
             │               ├─{ChainHeadChanne}(3202,747)
             │               ├─{CombinedStorage}(2213,747)
             │               ├─{Common-Cleaner}(1160,747)
             │               ├─{CompletableFutu}(2231,747)
             │               ├─{Connector-Sched}(4436,747)
             │               ├─{Eth1DepositStor}(2214,747)
             │               ├─{Eth1EventsChann}(2216,747)
             │               ├─{Eth1EventsChann}(2780,747)
             │               ├─{ExecutionLayerC}(2219,747)
             │               ├─{FinalizedCheckp}(2781,747)
             │               ├─{FinalizedCheckp}(2788,747)
             │               ├─{FinalizedCheckp}(2790,747)
             │               ├─{FinalizedCheckp}(2872,747)
             │               ├─{Finalizer}(1152,747)
             │               ├─{ForkJoinPool.co}(2651,747)
             │               ├─{ForkJoinPool.co}(2652,747)
             │               ├─{ForkJoinPool.co}(2653,747)
             │               ├─{G1 Conc#0}(1147,747)
             │               ├─{G1 Main Marker}(1146,747)
             │               ├─{G1 Refine#0}(1148,747)
             │               ├─{G1 Refine#1}(1181,747)
             │               ├─{G1 Refine#2}(1182,747)
             │               ├─{G1 Refine#3}(2458,747)
             │               ├─{G1 Young RemSet}(1149,747)
             │               ├─{GC Thread#0}(1145,747)
             │               ├─{GC Thread#1}(1178,747)
             │               ├─{GC Thread#2}(1179,747)
             │               ├─{GC Thread#3}(1180,747)
             │               ├─{GossipRouter-ev}(2827,747)
             │               ├─{OkHttp TaskRunn}(20012,747)
             │               ├─{OkHttp TaskRunn}(28340,747)
             │               ├─{Okio Watchdog}(2794,747)
             │               ├─{QuartzScheduler}(2225,747)
             │               ├─{Reference Handl}(1151,747)
             │               ├─{RejectedExecuti}(3206,747)
             │               ├─{RejectedExecuti}(3337,747)
             │               ├─{RxCachedWorkerP}(1400,747)
             │               ├─{RxSchedulerPurg}(1399,747)
             │               ├─{Service Thread}(1155,747)
             │               ├─{Session-HouseKe}(3187,747)
             │               ├─{Signal Dispatch}(1154,747)
             │               ├─{SlotEventsChann}(2218,747)
             │               ├─{SlotEventsChann}(2459,747)
             │               ├─{SlotEventsChann}(2775,747)
             │               ├─{SlotEventsChann}(2785,747)
             │               ├─{SlotEventsChann}(2786,747)
             │               ├─{SlotEventsChann}(2791,747)
             │               ├─{SlotEventsChann}(2797,747)
             │               ├─{SlotEventsChann}(2798,747)
             │               ├─{SlotEventsChann}(2847,747)
             │               ├─{SlotEventsChann}(2848,747)
             │               ├─{SlotEventsChann}(2851,747)
             │               ├─{SlotEventsChann}(2853,747)
             │               ├─{SlotEventsChann}(2855,747)
             │               ├─{SlotEventsChann}(2857,747)
             │               ├─{Sweeper thread}(1158,747)
             │               ├─{TimeTick-1}(2224,747)
             │               ├─{VM Periodic Tas}(1159,747)
             │               ├─{VM Thread}(1150,747)
             │               ├─{ValidatorApiCha}(2854,747)
             │               ├─{ValidatorTiming}(3200,747)
             │               ├─{VoteUpdateChann}(2217,747)
             │               ├─{beaconchain-asy}(2227,747)
             │               ├─{beaconchain-asy}(2457,747)
             │               ├─{beaconchain-asy}(2462,747)
             │               ├─{beaconchain-asy}(2464,747)
             │               ├─{beaconchain-asy}(2472,747)
             │               ├─{beaconchain-asy}(3141,747)
             │               ├─{discovery-clien}(3191,747)
             │               ├─{discovery-expir}(3151,747)
             │               ├─{discovery-tasks}(3149,747)
             │               ├─{forkChoiceNotif}(2771,747)
             │               ├─{forkchoice-asyn}(2765,747)
             │               ├─{java}(1143,747)
             │               ├─{java}(2199,747)
             │               ├─{nioEventLoopGro}(3136,747)
             │               ├─{nioEventLoopGro}(3140,747)
             │               ├─{nioEventLoopGro}(3152,747)
             │               ├─{nioEventLoopGro}(3153,747)
             │               ├─{nioEventLoopGro}(3154,747)
             │               ├─{nioEventLoopGro}(3195,747)
             │               ├─{nioEventLoopGro}(3235,747)
             │               ├─{nioEventLoopGro}(3298,747)
             │               ├─{nioEventLoopGro}(3344,747)
             │               ├─{nioEventLoopGro}(3381,747)
             │               ├─{p2p-async-0}(3183,747)
             │               ├─{p2p-async-1}(3220,747)
             │               ├─{p2p-async-2}(3234,747)
             │               ├─{p2p-async-3}(3238,747)
             │               ├─{p2p-async-4}(3247,747)
             │               ├─{p2p-async-5}(3248,747)
             │               ├─{p2p-async-6}(3252,747)
             │               ├─{p2p-async-7}(3255,747)
             │               ├─{p2p-async-8}(3284,747)
             │               ├─{p2p-async-9}(3288,747)
             │               ├─{p2p-async-sched}(3170,747)
             │               ├─{pool-3-thread-2}(32120,747)
             │               ├─{pool-3-thread-2}(5570,747)
             │               ├─{pool-3-thread-2}(3128,747)
             │               ├─{powchain-async-}(3198,747)
             │               ├─{powchain-async-}(3205,747)
             │               ├─{powchain-async-}(3219,747)
             │               ├─{powchain-async-}(3223,747)
             │               ├─{powchain-async-}(31914,747)
             │               ├─{powchain-async-}(30222,747)
             │               ├─{qtp1543664490-9}(3168,747)
             │               ├─{qtp1543664490-9}(3171,747)
             │               ├─{qtp1543664490-9}(3172,747)
             │               ├─{qtp1543664490-9}(3175,747)
             │               ├─{qtp1543664490-9}(3176,747)
             │               ├─{qtp1543664490-9}(3177,747)
             │               ├─{qtp1543664490-9}(3178,747)
             │               ├─{qtp1543664490-9}(3179,747)
             │               ├─{storage_query-a}(6094,747)
             │               ├─{storage_query-a}(6310,747)
             │               ├─{storage_query-a}(8570,747)
             │               ├─{storage_query-a}(8608,747)
             │               ├─{storage_query-a}(11179,747)
             │               ├─{storage_query-a}(11296,747)
             │               ├─{storage_query-a}(18059,747)
             │               ├─{storage_query-a}(18930,747)
             │               ├─{storage_query-a}(18957,747)
             │               ├─{storage_query-a}(19624,747)
             │               ├─{sync-async-0}(3236,747)
             │               ├─{validator-async}(3201,747)
             │               ├─{validator-async}(3204,747)
             │               ├─{validator-async}(3456,747)
             │               ├─{validator-async}(3461,747)
             │               ├─{validator-async}(3552,747)
             │               ├─{validator-async}(3632,747)
             │               ├─{vert.x-acceptor}(1407,747)
             │               ├─{vert.x-eventloo}(1408,747)
             │               └─{vertx-blocked-t}(1349,747)
gitmewai commented 2 years ago

Please note that the issue persists at the latest v22.7.6 release

ahamlat commented 2 years ago

Thanks for reporting @gitmewai. The metric displayed by Collectd is the one in /proc/stat, that we can also display with vmstat -f. I was able to reproduce, I'm looking into this to understand what causes this metric to grow that fast.

gitmewai commented 2 years ago

This is what we observed from the system log @ahamlat :

Oct 11 18:15:09 besu-node besu[847]: 2022-10-11 18:15:09.656+00:00 | vertx-blocked-thread-checker | WARN  | BlockedThreadChecker | Thread Thread[vert.x-worker-thread-14,5,main] has been blocked for 20195242 ms, time limit is 60000 ms
Oct 11 18:15:09 besu-node besu[847]: io.vertx.core.VertxException: Thread blocked
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/jdk.internal.misc.Unsafe.park(Native Method)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.jsonrpc.internal.methods.ExecutionEngineJsonRpcMethod.response(ExecutionEngineJsonRpcMethod.java:89)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.jsonrpc.execution.BaseJsonRpcProcessor.process(BaseJsonRpcProcessor.java:42)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.jsonrpc.execution.TracedJsonRpcProcessor.process(TracedJsonRpcProcessor.java:41)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.jsonrpc.execution.TimedJsonRpcProcessor.process(TimedJsonRpcProcessor.java:45)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.jsonrpc.execution.AuthenticatedJsonRpcProcessor.process(AuthenticatedJsonRpcProcessor.java:51)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.jsonrpc.execution.JsonRpcExecutor.execute(JsonRpcExecutor.java:91)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.handlers.JsonRpcExecutorHandler.lambda$handler$8(JsonRpcExecutorHandler.java:79)
Oct 11 18:15:09 besu-node besu[847]:         at app//org.hyperledger.besu.ethereum.api.handlers.JsonRpcExecutorHandler$$Lambda$760/0x0000000840509c40.handle(Unknown Source)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.vertx.ext.web.impl.BlockingHandlerDecorator$$Lambda$1098/0x000000084069c040.handle(Unknown Source)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.vertx.core.impl.ContextImpl$$Lambda$975/0x0000000840612c40.handle(Unknown Source)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.vertx.core.impl.ContextImpl$$Lambda$972/0x0000000840613840.run(Unknown Source)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Oct 11 18:15:09 besu-node besu[847]:         at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
Oct 11 18:15:09 besu-node besu[847]:         at java.base@11.0.16/java.lang.Thread.run(Thread.java:829)
ahamlat commented 2 years ago

I don't think it is related @gitmewai. The log you showed just tells us that there is an engine RPC call taking a very long time, so the CompletableFuture is blocked on get method waiting for the Async call to complete. I think we should improve that and add a timeout to avoid blocking for a very long time. The only impact I was able to see with these forks was on context switches and interrupts which makes sens.

image image
ahamlat commented 2 years ago

@gitmewai We found the root cause of these fork processes, they're actually very short living threads created in our transactions and workers ThreadPoolExecutors. By doing Wall clock profiling with Async Profiler, I was able to see very short lived threads, with this name pattern : EthScheduler-Transactions-id.

image

In this code line, We're creating a ThreadPoolExecutor with coreSize=0 and keepAliveTime=0, this means for each execution, a new thread is created. By changing coreSize parameter, we can see less fork processes (threads in this case) :

image