near / read-rpc

Read-only NEAR RPC centralized-like performant solution
19 stars 6 forks source link

Performance problem ft. `near-vm` #236

Closed khorolets closed 1 month ago

khorolets commented 6 months ago

After upgrading the underlying dependency near-vm (to the 1.39.x nearcore) that we use to support query.call_function (view calls), we observe the performance issue.

The ReadRPC node eats up all the RAM under the load (250-300 req/s) in about 17 minutes. We checked and compared it with the previous version and figured it still dies, but later (load 500-800 req/s / ~12h).

We are currently narrowing down the problem and trying to find a solution. Current assumptions:

  1. We use near-vm in the wrong way
  2. Custom-ish caches we pass to near-vm has a bug
  3. near-vm has a bug

We bet on the first one and try to figure out the correct way of using it and what we are doing wrong.

Some tech details about what we are checking

We invoke near-vm here using tokio::task::spawn_blocking() since near_vm_runner::run is synchronous.

We tried to remove the spawn_blocking, and it helped, but the throughput dropped to about 20 requests per second. Not an option for us

So, right now, we are trying to optimize that call to achieve the best performance and resource consumption.

khorolets commented 6 months ago

I have a lead.

When we call with_compiled_and_loaded method under the hood it accepts the code: Option<&ContractCode>

ref: https://github.com/near/nearcore/blob/14334b9e4e2bff198537130324f2bb9702f54d6d/runtime/near-vm-runner/src/near_vm_runner.rs#L348-L359

And following down the road, it appears that if this value is None, it is a signal for the near-vm-runner to compile the contract even if we have it in the cache, and it hasn't changed ever since.

That looks like a place where we use the near-vm-runner crate incorrectly 🤦‍♂️ We naively pass the code every time, hoping near-vm-runner handles it automatically, but it appears it does not.

I've hacked a bit and checked on one contract with our synthetic test and heaptrack profiling shows we don't leak.

@kobayurii will implement logic to decide whether to pass the code using the run method. After that, we will be able to stress test it to ensure it helped.

kobayurii commented 6 months ago

https://github.com/near/read-rpc/pull/245

khorolets commented 6 months ago

The performance issue still exists.

In addition to that, when we deployed the optimistic block finalities support #186 to production, we observed an increase in latency because of overloading the ScyllaDB with READ requests.

We need to investigate how we read it during query.call_function because I suspect we are doing something wrong there and it has to be either optimized or redesigned.

khorolets commented 6 months ago

Quick update: the problem with the memory leak seem to be hidden behind the load balancer and autoscaler, so basically if we're not trying to push one instance of ReadRPC server to the limits, they behave well.

Now we are facing another problem with the usage of ScyllaDB around retrieving the state data, we do too many queries.

So when we serve more requests (with the support of optimistic block finalities #186) we do even more queries to ScyllaDB and thus overloading it, which leads to latency increase and timeouts.

Meanwhile, we turned down the feature on production while we investing the issue and looking for solutions.

Right now, we are adding more metrics to measure how bad is the situation and setting up the experimental environment to solve the issue.

Probably related to #150 and #193