Tracking issue: Benchmark TPS of NEAR Protocol

jakmeier commented 1 year ago

This tracks work related to the Pagoda 2023Q2 initiative ROAD-202: Establish a benchmark for 20m transactions per day with 5m from social.near

The high-level goal is to establish a meaningful benchmark that we can use to verify scaling capabilities of NEAR Protocol. Basically, if social.near become more popular and other traffic increases as well, can we handle the workload? At which scale will it stop working?

The way we want to answer these questions is with a benchmark that runs typical mainnet transactions. This same benchmark should also be useful to evaluate future optimizations as we add them.

A rough sketch of tasks to be done in this order:

[x] #9095
[ ] #9097
[x] #9118
[x] #9149
[x] Iterate on the above to ensure we get meaningful results
[ ] #9239
[ ] (stretch) Explore different scaling points (e.g. 50M + 50M) and find out the limits we can support today

Lower priority work to be done at any time:

[ ] #9134

Related issues

There are several ongoing initiatives around what I would roughly describe as "quantify performance". Here is how I think they are related to this issue.

Previous efforts: #8488
- We reuse the framework created. (PR #8847)
- Main difference: We no longer have the p99 latency as a target and we want traffic to be tailored towards a social DB contract and general mainnet traffic (instead of only an FT contract).
Congestion evaluation: #8920
- Probably that issue will use the same framework but it tests correctness rather than performance.
Storage performance metrics: #8897
- That issue looks at storage performance specifically, while the TPS benchmark here is end-to-end. There is a good chance that TPS throughput is also limited by storage performance. Hence, the two projects should inform each other in both directions.
- Storage metrics we care about will be defined => The TPS benchmark should track these metrics and tell us where storage performance is today
- The TPS benchmark defines typical load and what storage limits it reaches => ensure storage metrics are defined in a way that helps measuring this bottleneck

Current status (last updated: 21 June 2023)

Load generating framework is established and tested, only minor improvements planned
Basic load types are complete. We want to add more load type (aurora, storage heavy, ...) and define specific benchamark indices.
Network setup: Manual fork of testnet requires too much work, automation is needed. Localnet testing works well and easy.

jakmeier commented 1 year ago

@bowenwang1996 @akhi3030

I have created this tracking issue for this quarter's benchmarking effort that I want to drive. I wrote down what I think is the high level goal and what steps we want to take to get there.

Can you please review it and let me know if my understanding of our goal is correct? Please point out anything you didn't expect like this. I have to admin, my recollections of our discussions that lead up to this initiative are a bit vague. (And the finalized Jira issue is a bit sparse on description.) Plus, last quarter we had a bit of an alignment problem so I would like to avoid it this time around. :)

bowenwang1996 commented 1 year ago

@jakmeier thanks for writing down all the details in this issue! I think it is very well written and I don't have much to add 😃

akhi3030 commented 1 year ago

@jakmeier: this looks good to me and is aligned with my understanding. Thanks for writing it down!

jakmeier commented 1 year ago

cc @pugachAG @Longarithm regarding storage metrics: Just an FYI that this is the kind of benchmarking I plan to do. As described in the "related issues" section, I see some overlap with storage work and we might want to further align our efforts in the coming days and weeks. Also, please let me know if you have made significant progress in the realm of storage benchmarking that I should be aware of. :)

jakmeier commented 1 year ago

The current load generator has some limitations that I wanted to extend. For example, sending requests at a constant rate, rather then sending as many as possible.

Right now I'm thinking I will use a load generator library such as https://locust.io/ instead of adding the same features one-by-one to the current system. But I am still going to reuse as much as I can, just maybe not the load scheduling part. (As suggested by @akashin when we talked about the load generator today)

jakmeier commented 1 year ago

So with locust, we can define a class NearUser that knows how to send a Near transaction, and then reuse the FT transaction sending framework from the existing loadtest. The FT load definition for locust is then simply a class FtTransferUser(NearUser) like this:

class FtTransferUser(NearUser):
    @task
    def ft_transfer(self):
        receiver = ...
        self.send_tx(
            TransferFT(self.contract_account, self.account, receiver, how_much=1)
        )

Locust will then manage the scheduling and queuing of as many concurrent users as we specify, potentially distributed over many load generating machines.

And the setup code to initilaize an FT user is a method on the same class:

    def on_start(self):
        ft_account = self.contract_account
        # Register account on the FT contract
        self.send_tx(InitFTAccount(ft_account, self.account))
        # Send initial FT balance to user
        self.send_tx(TransferFT(ft_account, ft_account, self.account_id, how_much=1E8))

But some details are still unclear to me.

One question is how to do retrying:

We used a "fire-and-forget" send mechanism so far and then polled on the result, retrying after TTL seconds.
I'm now playing with send_tx_and_wait that would block on the RPC node and then I'd just mark it as failure if after TTL it was not successful.

Another thing I am figuring out right now is what's the best way to defined class NearUser.

In one PoC I have inherited from HttpUser and updated our cluster.py setup to share a requests session with locust. That automatically tracks all HTTP calls and their timing. That seems to be how locust wants to be used. (See docs) But this makes it harder to measure the latency of chained HTTP calls. (Especially if we use polling)
Another approach is to inherit from User without the HTTP and glue events together such that we measure the time between exact point in code that we care about. (E.g. "tx sent to RPC" until "polling for tx gave positive result") This seems a bit more complicated but probably not too bad anyway.

I'll try to make some pragmatic decisions and get a PR ready asap. For now, I just wanted to document the design process.

bowenwang1996 commented 1 year ago

@marcelo-gonzalez what do you use to replay mainnet traffic? Could we do something similar here?

marcelo-gonzalez commented 1 year ago

@marcelo-gonzalez what do you use to replay mainnet traffic? Could we do something similar here?

The script is here: https://github.com/near/nearcore/blob/master/pytest/tests/mocknet/mirror.py (and I'm working on making some changes that will make it easier to use), which sets up a traffic generator that uses the neard mirror command.

Basically what it allows is to run a network of custom size (N validators, M extra rpc nodes, etc) and send it traffic starting from some point in the past on mainnet or testnet. So a first step could be to identify a window in the past where there was interesting/meaningful social DB traffic and prepare that test to start from there. Sadly there's no good script to prepare the images at the moment, so it would be manual, but that's not a huge deal for now. Then after starting the test, we'd get traffic that's pretty close to the traffic that was actually observed (I've run it a few times now, most recently to test the 1.34 release, and it looks like usually <1% of the transactions fail).

So maybe what we could do is to run that, plus another traffic generator sending extra transactions like @jakmeier described above. @jakmeier is there a recent traffic window you can think of that would be nice to replay (in terms of interesting social DB load)? I believe there are saved snapshots of RPC node databases going back quite far, so we could probably replay any 5-epoch-long period

aborg-dev commented 1 year ago

One question is how to do retrying:

We used a "fire-and-forget" send mechanism so far and then polled on the result, retrying after TTL seconds.

I'm now playing with send_tx_and_wait that would block on the RPC node and then I'd just mark it as failure if after TTL it was not successful.

My understanding from the discussion with @bowenwang1996 is that real clients are using broadcast_tx_commit with retries under the hood. This would be equivalent to send_tx_and_wait wrapped in a retry loop. So at least send_tx_and_wait seems like a logical choice for the test that tries to see how the system will behave under real load (e.g. congestion test). I agree that at first, it makes sense to skip retries and return an error after the timeout, though I think we'll eventually want to introduce retries with exponential backoff at this level to be consistent with real clients.

Another thing I am figuring out right now is what's the best way to defined class NearUser.

In one PoC I have inherited from HttpUser and updated our cluster.py setup to share a requests session with locust. That automatically tracks all HTTP calls and their timing. That seems to be how locust wants to be used. (See docs) But this makes it harder to measure the latency of chained HTTP calls. (Especially if we use polling)

I'm not sure I understand why this is the case - shouldn't the latency be measured for a whole @task (and not a single HTTP call issued within a task)? For polling, I would still expect it to happen within the same task, as that is how the real client will behave.

jakmeier commented 1 year ago

@jakmeier is there a recent traffic window you can think of that would be nice to replay (in terms of interesting social DB load)?

No idea, sorry. Maybe there was meaningful traffic during consensus? Maybe the Pagoda hackathon provided some meaningful traffic? But nothing would be close to the type of test I want to do, as I want to reach the network's capacity and go even beyond to test our behavior under congestion.

I'm not sure I understand why this is the case - shouldn't the latency be measured for a whole @task (and not a single HTTP call issued within a task)? For polling, I would still expect it to happen within the same task, as that is how the real client will behave.

Yes, I have written it all in the same task and want the per task latency. However, locust does not gather statistics per task, it gathers them per HTTP call and groups by endpoint. To get per task statistics, I think I need to trigger some custom events. Which isn't too bad, it's just not built into locust, which surprised me a bit.

aborg-dev commented 1 year ago

Yes, I have written it all in the same task and want the per task latency. However, locust does not gather statistics per task, it gathers them per HTTP call and groups by endpoint. To get per task statistics, I think I need to trigger some custom events. Which isn't too bad, it's just not built into locust, which surprised me a bit.

Indeed, that's surprising. We can try using TransactionManager plugin for this: https://github.com/SvenskaSpel/locust-plugins/blob/master/examples/transaction_example.py

jakmeier commented 1 year ago

Yeah I've seen that, seems almost too powerful as it tracks across tasks, not only for a single task. Firing custom events, seems simpler but I will see...

jakmeier commented 1 year ago

So far I am quite happy with locust, and I have a PoC ready in a branch. Here is the link to the Readme describing the setup, in my branch: https://github.com/near/nearcore/tree/tps-benchmark-2023q2/pytest/tests/loadtest/locust

(Not quite ready to submit a PR but not too far off either)

What can the PoC do right now?

It can set up an FT contract and spawn N users that each register themselves and start randomly sending FTs to other users. Then it continuously collects information about requests per second and latency. And it keeps track of all failures. Here some screenshot to give you an idea:

Limitations:

The load generator doesn't properly work when running with multiple workers, so we are limited by a single CPU core (should be easy to fix since locust is all about distributed load generation)
Right now only one FT contract is used, so we are limited to one shard
Workload other than FT contracts have not been defined yet
The load generator assumes an empty database and there are some Nonce mismatches if re-running on the same network without resetting it

Despite all these limits, running a 2-node localnet on my laptop, I observed a stable 200-250 successful transactions per second. So that's on one hand confirming the results from previous quarter. But on the other hand, I didn't run it long enough to see the problems seen there.

Next steps?

I want to define some SocialDB workload
We should run this on a proper network setup and see what it spits out (@marcelo-gonzalez maybe you are interested in running something like this?)

bowenwang1996 commented 1 year ago

I observed a stable 200-250 successful transactions per second

@jakmeier this feels a bit suspicious to me. Each ft_transfer is a function call and therefore consumes at least 4.8Tgas. I am not sure how we can achieve 200-250 of those unless we changed the chunk gas limit in the benchmark.

jakmeier commented 1 year ago

Hm, I don't see the problem. While I ran with a single contract, I did have a multi-shard layout. So the signing half happily happens on a different shard from the execution.

Very rough gas calculation; 250 2.4 Tgas for signing => 600 Tgas per second => 780 Tgas per block if we assume 1.3 block time 250 3 Tgas for execution => 750 Tgas per second => 975 Tgas per block if we assume 1.3 block time

At least I wouldn't predict it any other way based on gas limits. But please do let me know if I missed something.

jakmeier commented 1 year ago

Just a quick update:

I have SocialDB intialization and random follow transactions working. (Details in #9095)
Running a mix of 500 users sending FT transfers and 500 SocialDB users following random other users, I still get 200+ TPS and no stability issues at all. (200TPS ~ 17M daily transactions) (Running with more users causes throttling on the load generator because it is still limited by single core python performance.)

This confirms that gas is not going to be the limiting factor. Nor are there hidden inefficiencies elsewhere in the runtime.

However, it will become more interesting once we increase the state and try to bring RocksDB performance to collapse.

near / nearcore