Metrics that show gas weight multiplied gas consumption vs compute time

jakmeier commented 1 year ago

Prerequisite: #8033

Once this is done, our gas profile updates and our parameters are aligned. This allows to show metrics that compare execution time with the gas spent, for most parameters.

The goal is that when we charge parameters, we also measure the time it takes to finish the work paid for by this amount of gas. Then we can keep histogram counters and display it on Grafana. That way, we could immediately see which parameters are under- or overcharged in practice. This information would be REALLY valuable. Both as an additional data point for the gas parameter estimator, but also for live-debugging observed slowness of block production on mainnet or testnet.

However, this sounds easier than it is. We charge gas parameters nested and this can be tricky to match with execution time. Consider the following example.

pay(A);
{
  workload_a_0();
  pay(B);
  {
    workload_b();
  }
  workload_a_b();
}

Here we want to measure the time it took to execute workload_b() and compare it to the cost in pay(B). Plus, we want to measure the time used for A and compare it with its cost. But for A, we would need to subtract the time and cost of B to get a meaningful output.

My high-level idea here would be to make GasCounter or ProfileData smarter, such that whenever we charge a parameter, it will push it on a internal stack that keeps track of which is the active parameter that has paid for the current execution. It could even return a guard type, that automatically pops the stack again once it goes out of scope.

The above high-level sketch works for the nested example. But it will struggle with places where we charge multiple costs together, such as charging the base cost and the per-byte costs for a deployment action. I suggest we add a method for such cases, that pays the two atomically and only generates one single entry on the gas accounting stack.

aborg-dev commented 1 year ago

I've started looking into this by adding time measurements for sha256 host function: https://github.com/near/nearcore/commit/2def4f986ede57bc5e25854ffd9603a724dc68f4.

I think your plan about extending ProfileData and GasCounter with some stack structure to make taking these measurements simple makes sense and I'll look into it after I figure out the end-to-end story. For now just scoping into a simple host function without nesting.

To get a useful comparison for sha256 in Grafana I think we need another layer of pre-processing that merges together sha_256_base and sha_256_byte costs to compare them with corresponding time measurement (as it seems tricky to decouple the two and probably does not yield much value). I don't yet know whether this pre-processing should be done on Grafana or on Profile Export side.

Some questions:

Is there any guidance on how to deploy and test this feature in some dev environment? I'm thinking about running a node with these profiles and then showing them on local or remote Grafana instance.
Is sha256 a good starting point or should I pick something different? I'm thinking from the point of view of seeing interesting data in Grafana.

jakmeier commented 1 year ago

Regarding testing, I would recommend a GCP setup. I'll DM you on Zulip.
sha256 seems quite interesting to look at for a start. If you want something as simple as possible, maybe validator_total_stake might be best. But maybe not very interesting to look at in Grafana.

And yeah, I think merging base and per-byte costs will be necessary. But I don't really know how to do that in a nice way. But some extra food for thought: You should also include the gas costs that are hidden inside get_memory_or_register! and self.registers.set. They charge wasm_read_memory_base, wasm_read_memory_byte, wasm_read_register_base, and wasm_read_register_byte.

Hm, I wonder if it might be easier to read the burned gas counter when entering to and when returning from the host. It wouldn't be per parameter, but that's ultimately not really a requirement.

Starting with just one host function and keeping it as simple as possible seems like the best option for now. But once you want to generalize it, you could even do it in the add_import! macro: https://github.com/near/nearcore/blob/83aeb4a6f66f1b5414ada6055718a4e89a507544/runtime/near-vm-runner/src/imports.rs#L270

We already have some tracing support there, which you can observe with RUST_LOG=host-function=trace when you run a node.

jakmeier commented 1 year ago

It's been a few months without updates on this issue, so I'll write down the status.

I believe we are no longer working on this at the moment. But we already have metrics and dashboards that shows gas or compute cost vs wall-clock time on a per-chunk level. We just don't have it on a per parameter basis. It's unclear if we want to spend the effort any time soon to add such fine-grained metrics.

near / nearcore

Metrics that show gas weight multiplied gas consumption vs compute time #8258