near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.32k stars 624 forks source link

Proposal: Revamp parameter estimation #6445

Open jakmeier opened 2 years ago

jakmeier commented 2 years ago

Secure gas cost estimation is hard and not well explored. The closest well-known thing is benchmarking systems to compare performance. That itself is already non-trivial. But people have studied it, created tools for it, and written a vast literature around it that describes how to do it scientifically. To prove that A is faster than B in doing X, it is generally accepted to measure it many times and show that the confidence intervals of both distributions do not overlap. This disproves the null-hypothesis of both performing the same, making it a statistically relevant result.

For gas cost estimation, we face a more difficult problem. Statistical differences are almost irrelevant for us. Instead, we have to assume a byzantine model for most of our workloads. That means, for our estimation to be correct, we would have to prove that in all possible circumstances, even maliciously crafted ones, the observed time to execute is not above our value. Empirical benchmarks, by definition, are never able to cover all possible circumstances. Thus, we have to come up with ideas how to get something that has a high probability of hitting worst-cases.

For the initial values of all parameters, the strategy used looks to me like a benchmarking setup to find an average and multiplying all values by 3 to be "on the safe side". This is a valid approach to get things started. But it does not give us the robustness we seek for a security critical system.

This is not just a theoretical problem. We have repeatedly found cases where our estimations that initially seemed reasonable, turned out to be be undercharging by large factors. We would also not be the first blockchain project to fall victim to resource exhaustion attacks based on miscalculated costs. (EIP-150 for example)

This document describes a proposal to address these fundamental issues in gas cost estimation. I argue we can do so by only changing the estimation strategies, without touching the gas charging model. Thus, it would have a minimal footprint on user experience, as the only thing blockchain users could observe are parameter adjustments.

Why exactly am I unhappy with today's estimation model?

Let me put a finger on what exactly goes wrong today. The estimation model has three important parts. Setup, workload and measurement. For our estimations today, we use one setup (a clean blockchain state with a fixed number of accounts inserted) and two metrics (time and icount) and a different workload for each parameter. Especially the setup and the measurement parts are broken, in their own ways.

Using the wrong setup

The setup we use is decently standardized to the degree that it gives us reproducible results. Unfortunately, that says nothing about its accuracy. With a large enough number of accounts inserted and caches turned off where possible, I believe it is close to one of many realistic scenarios. But it does not simulate the worst-case.

Essentially, we pretend that we can run a workload once in a lab setting and conclude from it how it will perform in all possible scenarios of the real world.

One more problem here is that the lab setting runs a single shard, while the real world runs multiple shards that compete over shared resources, such as IO. We could easily have almost live-locks in production and never catch it in the lab.

Measuring the wrong thing

The first metric, measuring wall-clock time, is completely dependent on the hardware it is running on. And even then, the results can have high variation. As far as I know, these were the reasons we created the other metric in the first place.

The second metric counts x86 CPU instructions, read IO bytes and write IO bytes. The counts are multiplied with three different factors. These factors were estimated, too, and have then been fixed for all future estimations.

Similar to the setup, this is great at producing reproducible results. But the accuracy is questionable at best. The total number of IO bytes and CPU operations is not a good measure to describe a workload, if we know nothing about the IO patterns and what the operations are. (I will get into further details and explanations below.)

Both of these metrics are acceptable for benchmarking. But for estimating limits, they are fundamentally not able to give us the confidence we need to operate NEAR Protocol securely.

The suggested model in a nutshell

For example, an estimation of writing to the blockchain store from within a smart contract could be described by this protocol-layer pattern.

storage_set(ABC, "MyValue")

One possible DB pattern (depending on DB state) corresponding to this:

Database(ColState) get(trie-node A) Database(ColState) get(trie-node AB) Database(ColState) get(trie-node ABC) Database(ColState) transaction

{ DBOp::Insert(trie-node A) DBOp::Insert(trie-node AB) DBOp::Insert(trie-node ABC) }

Database(ColTrieChanges) transaction

{ DBOp::Insert(block0|shard0|insert:trie-node A,trie-node AB,trie-node ABC) }

The replay tool can now digest both patterns and run them in a number of settings. For example, with a huge RocksDB as database that is also running compaction in the background. Or it can run it while other threads are also accessing the database, simulating multiple shards being tracked on one node.

We can also have the algorithm "know" that "MyValue" can be changed to strings of certain lengths and try a couple of extreme cases. Potentially in the future, a genetic search algorithm can assist in finding the absolute worst values. (Same for finding bad DB states)

The replay tool will then output an estimated time for executing all reads in the trace and one for all writes. In combination with the time-based measurement, we can derive a total cost that fully captures CPU and IO costs.

Intermediate results, mapping equivalent DB-layer patterns to an estimated maximum throughput, can be cached in a persistent database, to avoid unnecessary recomputation.

What items we need to implement and how it could be done in a few milestones that add incremental value, is all described at the bottom in this document. But first, let me explain how exactly this proposed framework solves many of our problems. I also list alternative solutions that I considered for each problem.

Problems that the new estimation model fixes

CPU ops cost curve is not linear

Icount-estimation assumes every x86-64 instructions requires the same, fixed amount of time. We apply the same idea on WASM operations, that currently cost a fixed amount of gas, no matter which operation it is. But there are two crucial differences between x86-64 and WASM.

  1. It is easy to get an overview over all WASM operations. The complexity of each operation low and bounded. x86-64 in comparison has a gigantic range of complexity between operations.
  2. The cost per WASM operation needs to be simple because we charge it dynamically at runtime. The estimation can afford to be more complex.

Given these points, it seems unreasonable to me to estimate gas cost based on Op count * X.

Proposed solution : Use time-based measurements for CPU cost to capture true execution cost without the need to understand every detail of CPUs.

Alternative 1 : Instead of counting x86-64 ops, add up cost based on instruction scheduling tables (as used in public compilers) for a better estimate of the cost. This is much better than equal cost for all operations but still fails to account for the reality of how out-of-order, super-scalar CPUs execute x86-64. Alternative 2 : Static analysis of x86-64 operations during estimation, to find theoretical bounds including knowledge about CPU internals. As accurate as this would be, it is also much more involved, requires access to a decent x86-64 performance model, and is highly local to a specific CPU architecture.

Time-based measurements are not additive

Reducing the cost of a transaction to a single dimension, time, has its own problems. A system is always constrained by a number of subsystems. For example, CPU and disk. These subsystems are running in parallel, under some constraints by data dependencies. A time measurement is often dominated by one bottleneck and the other subsystems play a small role in the final measurement.

To show why this is problematic, here is an example. Assume that we want to know the cost of CreateAccountAction but we cannot measure it directly. Instead, we measure SomeMeasurementOverhead+CreateAccountAction. Then, we measure SomeMeasurementOverhead and subtract it from what was measured before. This can yield very wrong results, if SomeMeasurementOverhead and CreateAccountAction are bottlenecked by different subsystems.

The problem is easiest to demonstrate in formulas:

| `SomeMeasurementOverhead` | 1s | 0.1s | 0.1s | 1.1s | | `CreateAccountAction` | 0.5s | 1s | 0.1s | 1.1s | | `CreateAccountAction` +`SomeMeasurementOverhead` | 1.5s | 1.1s | 0.2s | 1.7s |

If the table above shows the real costs but we can only measure SomeMeasurementOverhead = 1.1s and CreateAccountAction +SomeMeasurementOverhead = 1.7s, we might conclude that CreateAccountAction = 0.6s. But it should be 1.1s!

This anomaly occurs because the execution time has a non-linear dependency (the max function in this example) on its subsystems. The reality is probably not perfectly described by max of all subsystems but it it almost certainly not perfectly linear.

Proposed solution I : Measure directly whenever possible. Proposed solution II : If subtraction is absolutely necessary, make sure to subtract time measured of individual components, rather than final gas values.

Alternative 1 : Never subtract measurements, just measure the closest upper bound we can. Alternative 2 : Use the ICount-based estimation. I already described why it does not work for CPU. Next couple of points tackle IO problems.

IO cost curve is not linear

Representing IO costs as x bytes read and y bytes written is not a good summary of IO costs. Reading 4kB could mean everything is contained in a single page and only causes one IO operation. Or, it could mean 4k synchronous IO operations.

Proposed solution : Capture traces of exact addresses/keys and values that are read and written. Replay those traces on one or more reference systems to receive a gas cost. Alternative :

Write IO is asynchronous

After execution of a block has finished, the writes are still pending in the background. They might hang in a write buffers of RocksDB or the OS. Maybe a background thread is already working on getting writes through, or maybe it will only be triggered after some more data is written.

It is possible to force flush all buffers and observe the total data written in this way. But this has a lot of side-effects on CPU and read IO. This makes it difficult to measure write IO of a specific workload alongside CPU and read IO costs.

Proposed solution : Using the proposed trace replayer, estimate read IO and write IO in separate runs.

Multiple shards per machine share the same DB

Today, a validator tracks 4 shards. In the future, it could be more or it could be less. In any case, we have to include in our gas cost model that multiple shards are validated by the same machine. This requires an estimation strategy that respect the multi-shard setup.

CPU cores and memory requirements can be scaled up linearly with the number of shards. As these resources are mostly independent between shards, this is unlikely to cause problems. For IO, however, this is not so clear. Linear scaling of required IOPS might sound reasonable, too. But if the database uses any kinds of locks, this will quickly lead to stalls no matter how many IOPS are available.

Proposed solution : The replayer runs on multiple shards at once to measure IO timings in a shared DB scenarios.

Alternative : Change the client such that each shard has its own database instance.

Hardware is unique and, at best, only partially ordered

One computer is not strictly faster than another, it always depends on the workload. Therefore, it is dangerous to have do estimations on only one machine.

Proposed solution : Once this replayer is implemented, we could add an extra layer that combines estimations run on different machines and picks global worst-case. On top of that, we could rerun the cheap time-based estimation daily but only re-estimate IO costs if there are new traces that are not equivalent to existing traces.

Systems are designed to be best-case on average

There is a general problem of the experimental/empirical approach to estimate costs. If we pick one specific execution time as our estimation, it will either be too optimistic or too pessimistic.

Too optimistic: Even if we try to think of the absolute-worst case scenario, we can never be sure if there is not a different case where things would be even slower. If a malicious actor finds this case, they might be able to use it as an attack vector.

Too pessimistic: The difference between average time to execute a workload and its 99th percentile is huge. Taking the CPU as an example, the average case probably has >50% hits in L1 cache and predicts most branches correctly. The 99th percentile probably has to go to main memory and deals with constant branch mispredictions. This translates to orders of magnitude in wall-clock time. If we charge for the 99th percentile, the servers will be idle most of the time when not running malicious code.

We need to strike a balance between charging the expected case and the worst case.

Proposed solution : Keep current empirical strategy but use more than one measurement The number of measurements can be increased over time. Starting with manually crafted settings, adding more as we encounter more cases of undercharging. But ideally we finally turn to an automated system that searches for DB states and WASM codes that are particularly slow to execute. Such as genetic algorithms, which have been demonstrated to successfully find slow running contracts in EVM. Perez, Daniel & Livshits, Benjamin. (2019). Broken Metre: Attacking Resource Metering in EVM. (Note that this becomes much more feasible once we have split the execution in smaller parts that can run independently. Today's full-system runs are presumably too slow to effectively perform such automated search. This changes a lot if we can see that two input produce an equivalent DB trace and thus can use cached estimation results.)

Some nice side-effects

Negatives

Remaining estimator problems not solved directly

Network not accounted for

This proposal does not include network resources. I believe we need to add that at some point and I could see it easily fit into the overall suggested strategy, as one more throughput bottleneck to measure.


Work Items

Possible Milestones

Implementing all of this at once is too risky. So here are some milestones that we can add with incremental benefit.

  1. DB trace capturing and simple replayer. This can already be useful for storage performance debugging and help to better understand how good/bad all out storage related gas parameters are today with respect to different RocksDB states.
  2. Improved DB replayer that separates reads from writes. (Same benefit as milestone 1 but more accuracy.)
  3. Increase efficiency of DB replayer by adding a cache for "equivalent traces"
  4. Capture and replay protocol-layer traces. This is the last puzzle piece to create a new estimator metric that estimates the time of CPU and IO separately.

Non-essential steps for the future:

  1. Manually add more DB states to find more slow cases
  2. Iterate over all combinations extreme inputs in variable inputs of protocol layer
  3. Explore automated smart search for worst-case state and inputs. (e.g. using genetic algorithm) (this could also be a student project)
Ekleog commented 2 years ago

An additional drawback is that the costs would then become non-reproducible, while AFAIU they currently are reproducible. In a way, this puts us more in a position of a trusted third party than with a reproducible evaluation model.

This being said, I don't think this drawback is significant enough to outweigh the gains listed here :)