Recommendations for how to increase goloop RPC performance

robcxyz commented 2 years ago

I am loading up goloop with RPC calls and am seeing major performance degradation as the number of requests goes up to the point that goloop slows down to being only able to serve ~10 requests per second. This has nothing to do with the CPU / system resources / network as we are testing goloop on a single large dedicated node with 48 threads and watching the CPU, it goes down to near 0 when the stalls happen.

We have tried changing the batch limit and modifying the amount of requests we are sending it at a time. While we are able to send 10s of thousands of RPS, goloop seems to initially be able to serve >1k RPS but over the course of a couple minutes the RPS will slow down to ~10.

Question would be is there anything that can be set to increase performance? I am looking through the goloop code and I see references to a cpu profile, max_rps, max_workers, and concurrency but am not able to decipher 1, how to even set these parameters and 2, which ones I could modify to see if they have an impact on performance.

While I am not still trying to diagnose this, this definitely seems like a bug with goloop because CPU should not drop to 0 when performance drops to 0.

As more context, this is for supporting a new version of the tracker backend / indexer along with how to properly deploy the community API nodes. We have access to large bare metal nodes but given that CPU doesn't seem to be that important, we might be better off load balancing across a larger set smaller nodes than a few very powerful ones. This is going to be far less efficient though than being able to actually use the resources of a large node. This is a very important item for us to flush out as it has direct impact on the performance of our stack.

robcxyz commented 2 years ago

As an update, I tried changing the batch down to 1 which after doing brought down the goloop_jsonrpc_failure_avg metric down to 1M. In the logs I am also seeing way fewer errors and the CPU is also much higher.

This does seem indicative of some kind of bug with batching that if it is enabled, the node should also have some other parameter provided to allow goloop to use more of the system resources. Any recommendations there would be good but right now it looks like we are getting about 500 RPS which is reasonable for our use case and for community usage but could definitely see improvement if goloop had access to more of the systems resources.

jspark-icon commented 2 years ago

We have tried changing the batch limit and modifying the amount of requests we are sending it at a time. While we are able to send 10s of thousands of RPS, goloop seems to initially be able to serve >1k RPS but over the course of a couple minutes the RPS will slow down to ~10.

Can I get the exact details to reproduce your situation?

Question would be is there anything that can be set to increase performance?

Yes, some options have to do with performance improvements. but before applying the option, you should understand how to affect the performance and consider your purpose

the goloop_jsonrpc_failure_avg metric

is moving average of response time in nanoseconds (only failure cases) if you want to know the trends of failure count, watch the goloop_jsonrpc_failure_cnt metric.

robcxyz commented 2 years ago

Can I get the exact details to reproduce your situation?

I am using this tool we built to send RPC requests to nodes for syncing the tracker backend and doing lots of experiments around batching to figure out how I can sync as fast as I can as these RPC calls are the bottleneck in my pipeline. A few observations to point out:

Low batches (1-10) gets better results than high batches (100-1000)
- High batches increases the goloop_jsonrpc_failure_avg in general
- High batches will start off returning >1000 batched RPS but then quickly fall to extremely low response rates where the node basically becomes unresponsive
Performance seems to go down over time in general but this effect seems to be higher with batching
Most importantly: CPU stays low for goloop despite it not being able to respond to requests. I have been

Can you recommend some settings to increase performance / allow goloop to use more CPU?

jspark-icon commented 2 years ago

sudoblockio/icon-extractor seems to use icx_getBlockByHeight and icx_getTransactionResult only. in this case, the bottleneck could be rocks-db(or DISK IO), because those APIs are only for retrieving and all request is for an uncacheable read. However, I couldn't understand the result of your test which RPS slow down to ~10. If icon-extractor is not reached the latest height of ICON2 mainnet and blocks exist enough to retrieve, it couln't be. And to measure performance, use the jsonrpc_retrieve_cnt metric instead of the goloop_jsonrpc_failure_avg.

Can you recommend some settings to increase performance / allow goloop to use more CPU?

CPU usage ratio doesn't matter as above description. To increase the performance of icon-extractor, I think you may consider using multiple endpoints of physically distributed goloop node. Or you can make a tool that read from rocks-db directly if you can reuse goloop source.

robcxyz commented 2 years ago

Hi @jspark-icon - Thank you for your reply and help.

RE disk usage, this was the first thing I looked at after CPU usage and noticed the disk throughput wasn't very high and IOPs were steady at 20,000 tops. goloop was run on a server with direct attached nvme drives on raid 0 with IOPS limit in the hundreds of thousands at least.

RE the result of the low ~10 RPS - this was when we had batch turned on to 100 but still represented the throughput of actual requests (ie not 10 * 100). We are now not using any batching and getting more consistent results. Performance degredation did not seem immediate and was only realized over time with high load. Will try to replicate this again in the future so I can give concrete settings to see this behavior.

RE measuring performance, I don't have jsonrpc_retrieve_cnt but I have been monitoring rate(goloop_jsonrpc_retrieve_cnt[1m]) to see trends. Mostly we are looking at the throughput in number of blocks per second we are able to extract which load balanced over several nodes as you suggested.

RE rebuilding the icon-extractor to read from rocks db, that is something we are definitely interested in doing long term but for now are indexing based on RPC. This is suitable for indexing chains like ethereum and should be suitable for ICON. Also gives the community a solution that they can use without a full node but as shown above presents other challenges.

Happy to look into this more later but for now I am planning on running more instances of goloop on smaller nodes and load balancing across them.

jspark-icon commented 2 years ago

jsonrpc_retrieve_cnt is typo (missing prefix goloop_), goloop_jsonrpc_retrieve_cnt is correct.

robcxyz commented 2 years ago

@jspark-icon - Ahh my mistake. Was just reading metrics off the dashboard...

robcxyz commented 2 years ago

Closing this for now and will open an issue when we both have more bandwidth and will discuss offline.

icon-project / goloop

Recommendations for how to increase goloop RPC performance #100