pokt-network / poktroll

The official Shannon upgrade implementation of the Pocket Network Protocol implemented using the Cosmos SDK
MIT License
15 stars 8 forks source link

[Demand Scalability] Permissionless demand load testing & validation #742

Open Olshansk opened 2 months ago

Olshansk commented 2 months ago

Objective

Ensure the network can manage permissionless gateways, applications, services and other types of demand.

Origin Document

Goals

Deliverables

Non-goals / Non-deliverables

General deliverables


Creator: @Olshansk Co-Owners: @okdas

Olshansk commented 2 months ago

Update from @okdas

Hey, I wanted to share a quick update on the permissionless demand load testing effort.
- I decided to do all testing on our testnet. We've got gateways and supplier infrastructure there deployed and currently handles just hundreds of requests. 
- I don't have any interesting visuals yet. There are some findings:
    - Validator does consume a lot of resources, but it can be a result of a large number of RPC requests to the validator endpoint.
        - I'm going to change that endpoint to the full node so we validator will only validate.
        - Also there might be some room for improvement on how gateway/relayminer queries the data. Will check.
    - Gateways crash often. Might be a resource constraint, but as we are going to have a different gateway (path) - I'll throw more resources into them instead of performing deep troubleshooting/investigation.
    - Some of the blocks were pretty large for the amount of traffic (2.5 MiB). Will investigate and post findings tomorrow. (recent block - example https://shannon.testnet.pokt.network/poktroll/block/10297)
- Currently in the process of deploying an indexer so we can also get more insight.
- I had issues with creating a lot of services from one address. Same `account sequence mismatch, expected *, got *: incorrect account sequence` issue.
    - For some reason our CLI ignores `--sequence=` argument.
    - Comsmos 0.51 will have unordered transactions rendering this a non-issue in the future.
    - A workaround currently is creating many addresses, funding them with multi-send, and adding services from many accounts at the same time.
okdas commented 1 month ago

Performed more testing last week and ended up breaking the infrastructure around the validator's RPC.

To mitigate, I deployed and staked two more validators. Will rerun the largest test yet with relayminers pointed to the different node directly (without load-balancer and ingress-nginx).

okdas commented 1 month ago

Last time we synched on this, we've made a decision to:

As any somewhat large load tests currently breaks the network (#841) I'll be focusing on secondary goals - observability (lots of changes in #832) and deploying PATH on TestNet.

okdas commented 1 day ago

Have been running into this issue during load testing lately, will see if this is a low-hanging fruit.


{"level":"info","session_end_height":30,"claim_window_open_height":32,"message":"waiting & blocking until the earliest claim commit height offset seed block height"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","message":"observed earliest claim commit height offset seed block height"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","earliest_claim_commit_height":32,"message":"waiting & blocking until the earliest claim commit height for this supplier"}
{"level":"info","session_end_height":30,"claim_window_open_height":32,"claim_window_open_block_hash":"e36c39f113e47da8e0b1b3417bd05e823086c63a04f6f296382dd00d3b03877a","earliest_claim_commit_height":32,"message":"observed earliest claim commit height"}
{"level":"info","app_addr":"pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4","service_id":"anvil","session_id":"cb5157c91af08f0d126765b9279f2b0891ef5a56e64d50f396b2273a9464240b","supplier_operator_addr":"pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj","message":"created a new claim"}
{"level":"error","error":"with hash 0d82ff8b8e65935dae1ed423c9f4e8aa29b2036df3de78d0aea43d07b1e8a1f2: failed to execute message; message index: 0: rpc error: code = FailedPrecondition desc = current block height (37) is greater than session claim window close height (36): claim attempted outside of the session's claim window: tx timed out","message":"failed to create claims"}
{"level":"error","error":"with hash 0d82ff8b8e65935dae1ed423c9f4e8aa29b2036df3de78d0aea43d07b1e8a1f2: failed to execute message; message index: 0: rpc error: code = FailedPrecondition desc = current block height (37) is greater than session claim window close height (36): claim attempted outside of the session's claim window: tx timed out"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3c419a4]

goroutine 276 [running]:
github.com/pokt-network/poktroll/pkg/relayer/session.(*sessionTree).Delete(0x400192a6e0)
    /Users/dk/pocket/poktroll/pkg/relayer/session/sessiontree.go:285 +0x344
github.com/pokt-network/poktroll/pkg/relayer/session.(*relayerSessionsManager).deleteExpiredSessionTreesFn.func1({0x52d5290, 0x400197fd40}, {0x4000e76a00, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/relayer/session/session.go:478 +0x278
github.com/pokt-network/poktroll/pkg/observable/channel.ForEach[...].func1({0x4000e76a00, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:103 +0x6c
github.com/pokt-network/poktroll/pkg/observable/channel.goMapTransformNotification[...]({0x52d5290, 0x400197fd40}, {0x52ce590, 0x4000b71ec0}, 0x400013f860, 0x400013f8c0, 0x4000b9c9a0)
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:125 +0xc4
created by github.com/pokt-network/poktroll/pkg/observable/channel.Map[...] in goroutine 1
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:24 +0x318
[event: pod relayminer1-687547c69f-lvc5h] Container image "poktrolld:tilt-c8d80bb2e7daf0e1" already present on machine
okdas commented 1 day ago

Okaaay, seems like there's another issue that breaks the network that we are going to need to address before upgrade. Looking into this as well:


12:34AM INF Timed out dur=14979.481981 height=60 module=consensus round=0 step=RoundStepNewHeight
12:34AM INF received proposal module=consensus proposal="Proposal{60/0 (E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918:1:7CE673CD6F5A, -1) 3376100465F5 @ 2024-10-29T00:34:58.805851558Z}" proposer=A6B0BAD7039843C118CFC588D5A6D38C459B9C25
12:34AM INF received complete proposal block hash=E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918 height=60 module=consensus
12:34AM INF finalizing commit of block hash=E8DDDC9B7FD3B5622492459BCCF5B768577045437033E911194DD10B015DC918 height=60 module=consensus num_txs=0 root=8CF58F38B7F1DC22E6E227E7F74885A80B061E11ED20CA106E2E513553BF7113
12:34AM INF Stored block hash at height 60 EndBlock=SessionModuleEndBlock module=x/session
12:34AM INF found 1 expiring claims at block height 60 method=SettlePendingClaims module=x/tokenomics
12:34AM INF claim does not require proof due to claimed amount (1048950upokt) being less than the threshold (20000000upokt) and random sample (0.35) being greater than probability (0.25) method=proofRequirementForClaim module=server
12:34AM INF Claim by supplier pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj IS WITHIN LIMITS of servicing application pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4. Max claimable amount >= Claim amount: 6663868upokt >= 1048950 application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 helper=ensureClaimAmountLimits method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF About to start processing TLMs for (24975) compute units, equal to (1048950upokt) claimed actual_settlement_upokt=1048950upokt application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF Starting TLM processing: "TLMRelayBurnEqualsMint" actual_settlement_upokt=1048950upokt application=pokt1mrqt5f7qh8uxs27cjm9t7v9e74a9vvdnq5jva4 claim_settlement_upokt=1048950 method=ProcessTokenLogicModules module=x/tokenomics num_claim_compute_units=24975 num_relays=24975 service_id=anvil session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM INF sent 1048950upokt from the supplier module to the supplier shareholder with address "pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj" method=distributeSupplierRewardsToShareHolders module=x/tokenomics
12:34AM INF distributed 1048950 uPOKT to supplier "pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj" shareholders method=distributeSupplierRewardsToShareHolders module=x/tokenomics
12:34AM ERR error processing token logic modules for claim "77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c": TLM "TLMRelayBurnEqualsMint": burning 1048950upokt from the application module account: spendable balance 958026upokt is smaller than 1048950upokt: insufficient funds [cosmos/cosmos-sdk@v0.50.9/x/bank/keeper/send.go:278]: failed to burn uPOKT from application module account [/Users/dk/go/pkg/mod/cosmossdk.io/errors@v1.0.1/errors.go:155]: failed to process TLM [/Users/dk/go/pkg/mod/cosmossdk.io/errors@v1.0.1/errors.go:155] claimed_upokt=1048950upokt module=server num_claim_compute_units=24975 num_estimated_compute_units=24975 num_relays_in_session_tree=24975 proof_requirement=NOT_REQUIRED session_id=77eb6177946bf35f3a00c5932d94b64ff47224ad30971063da73405275bda49c supplier_operator_address=pokt19a3t4yunp0dlpfjrp7qwnzwlrzd5fzs2gjaaaj
12:34AM ERR could not settle pending claims due to error TLM "TLMRelayBurnEqualsMint": burning 1048950upokt from the application module account: spendable balance 958026upokt is smaller than 1048950upokt: insufficient funds [cosmos/cosmos-sdk@v0.50.9/x/bank/keeper/send.go:278]: failed to burn uPOKT from application module account [/Users/dk/go/pkg/mod/cosmossdk.io/errors@v1.0.1/errors.go:155]: failed to process TLM [/Users/dk/go/pkg/mod/cosmossdk.io/errors@v1.0.1/errors.go:155] method=EndBlocker module=x/tokenomics
12:34AM ERR CONSENSUS FAILURE!!! err="runtime error: invalid memory address or nil pointer dereference" module=consensus stack="goroutine 180 [running]:\nruntime/debug.Stack()\n\t/opt/homebrew/Cellar/go/1.23.2/libexec/src/runtime/debug/stack.go:26 +0x64\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:801 +0x4c\npanic({0x3f299c0?, 0x713b210?})\n\t/opt/homebrew/Cellar/go/1.23.2/libexec/src/runtime/panic.go:785 +0xf0\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock.func1()\n\t/Users/dk/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.50.9/baseapp/abci.go:860 +0x124\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).FinalizeBlock(0x4000223208, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.50.9/baseapp/abci.go:892 +0x374\ngithub.com/cosmos/cosmos-sdk/server.cometABCIWrapper.FinalizeBlock({{0xffff74564168, 0x4001081308}}, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.50.9/server/cmt_abci.go:44 +0x54\ngithub.com/cometbft/cometbft/abci/client.(*localClient).FinalizeBlock(0x400185df20, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/abci/client/local_client.go:185 +0xf8\ngithub.com/cometbft/cometbft/proxy.(*appConnConsensus).FinalizeBlock(0x40015806a8, {0x52d53a8, 0x7202380}, 0x4004f50480)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/proxy/app_conn.go:104 +0x1d0\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).applyBlock(_, {{{0xb, 0x0}, {0x40013a2cb9, 0x7}}, {0x40013a2ce0, 0x8}, 0x1, 0x3b, {{0x400534e5a0, ...}, ...}, ...}, ...)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/state/execution.go:224 +0x3c0\ngithub.com/cometbft/cometbft/state.(*BlockExecutor).ApplyVerifiedBlock(_, {{{0xb, 0x0}, {0x40013a2cb9, 0x7}}, {0x40013a2ce0, 0x8}, 0x1, 0x3b, {{0x400534e5a0, ...}, ...}, ...}, ...)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/state/execution.go:202 +0xd8\ngithub.com/cometbft/cometbft/consensus.(*State).finalizeCommit(0x4001729188, 0x3c)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:1772 +0xd50\ngithub.com/cometbft/cometbft/consensus.(*State).tryFinalizeCommit(0x4001729188, 0x3c)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:1682 +0x2c0\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit.func1()\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:1617 +0xb8\ngithub.com/cometbft/cometbft/consensus.(*State).enterCommit(0x4001729188, 0x3c, 0x0)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:1655 +0xd90\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0x4001729188, 0x4002e89a00, {0x0, 0x0})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:2335 +0x26c0\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0x4001729188, 0x4002e89a00, {0x0, 0x0})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:2067 +0x50\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0x4001729188, {{0x529e7c0, 0x40016261d8}, {0x0, 0x0}})\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:929 +0x5c0\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0x4001729188, 0x0)\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:856 +0x5fc\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 1\n\t/Users/dk/go/pkg/mod/github.com/cometbft/cometbft@v0.38.10/consensus/state.go:398 +0x1e4\n"
12:34AM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/root/.poktroll/data/cs.wal/wal
12:34AM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/root/.poktroll/data/cs.wal/wal
Olshansk commented 23 hours ago

@red-0ne Can you soft-confirm if the last one should be solved by the PRs you have open right now?

If so:

  1. Which one?
  2. Double-checking that there's on-chain safety against this?