Address Translation & Memory Model

Lichtso commented 3 years ago

The address translation is known to be the bottleneck of most workloads in the BPF VM at the moment.

Ideas to improve the situation:

Hoist address translation from loop bodies outside, so that they happen less frequently. This would require the static analyzer to be reliable and secure in order to detect loops. Also an efficient way to bypass the translated results into the body and check if they are still in bounds during the execution would be needed. Finally, a fallback in case the boundcheck fails is required too. Furthermore, dynamic calls must be restricted so that they can not jump into a loop. Either by excluding loops or only allowing registered symbols as jump table.
Manually optimize the address translation on x86 instruction level to reduce redundant boundchecks and the need for register spilling. This will probably not yield much, maybe a few percent improvement.
Change the memory model to something simpler, e.g. what WASM is doing. This would be the most radical change, but also the one which might help the most.

dmakarov commented 3 years ago

Has there been any profiling done on what fraction of time RBPF spends in address translation for some benchmark? (not the percentage of memory access instructions in a specific program, but actual wall-clock time spend in address translation code of the RBPF relative to the whole program execution time)

What would be the maximum theoretical improvement if address translation overhead were 0?

Lichtso commented 3 years ago

bench_jit_vs_interpreter_address_translation: 1016299 ns per loop block bench_jit_vs_interpreter_empty_for_loop: 96902 ns per loop block

Each loop block is 65536 iterations. So we are talking about 10x possible improvement.

Timing according to this recent CI benchmark run.

dmakarov commented 3 years ago

I think this is not exactly correct a comparison, because the first loop does a load from memory and address translation, but the second loop doesn't do address translation, neither does it do a load from memory. It would be good to measure the time it takes to do the address translation and then get the percentage of the whole execution time spent in address translation, that would show how much running time would improve if address translation overhead were 0. As shown, your 10x improvement includes time not performing a load operation on every loop iteration.

Lichtso commented 3 years ago

It does not matter that much if you subtract the rest of the loop: 1016299 / 96902 ~ 10.5 (1016299 - 96902) / 96902 ~ 9.5

I disabled the emission of x86 load instructions in the JIT, so that only the address translation happens. I tested it a few times with and without load instructions and the result remains the same (within the variance of the benchmark). Probably because it stays in the cache.

In other words: (1016299 - 96902) / 1016299 ~ 90% of the time is spent in address translation in that particular example.

dmakarov commented 3 years ago

This comparison is not very useful. If you compare a program that does only load instructions in this way, your speedup will be infinite, because without address translation the program's running time will approach 0. This is why I think it is important to take characteristic benchmarks representable of on-chain programs and estimate the expected speedup on such benchmarks.

dmakarov commented 3 years ago

except maybe it can be used as a baseline for time it takes to execute a single address translation, roughly 1 ms / 65536 or less than 16 ns.

Lichtso commented 3 years ago

This is why I think it is important to take characteristic benchmarks representable of on-chain programs

Sure, but how do you measure the time spent in address translation and nothing else, and without changing the timing of what you measure? These benchmarks here are far from perfect but still allow a rough estimate of the ball park we are talking about.

dmakarov commented 3 years ago

This is why I think it is important to take characteristic benchmarks representable of on-chain programs

Sure, but how do you measure the time spent in address translation and nothing else, and without changing the timing of what you measure?

You instrument the address translation code and measure the time it takes. You measure the overhead it takes to measure time and subtract that, if you want to be utterly precise, although in this case instrumentation overhead may be negligible.

These benchmarks here are far from perfect but still allow a rough estimate of the ball park we are talking about.

Rough estimate of what exactly? Of one load on 4 arithmetic instructions? Yes, like I said it may be taken as a baseline for the time it takes to do a single address translation. If for example 90% of your address translations happen for addresses that were previously translated, then simple software caching of translated addresses could speed up the address translation dramatically. It's important to know what you're optimizing and what the potential benefits are...

jon-chuang commented 3 years ago

My worry is that the measured addr translation overhead of 12-16ns (I've measured this locally too) doesn't explain the measured median CU/us of Serum on mainnet, corresponding to about 150ns/opcode.

The median across programs is about 20ns per opcode (50CU/us), which probably are not dominated by loads and stores...

Lichtso commented 3 years ago

@jon-chuang Keep in mind that this was measuring a tight loop with all loads always staying in cache. This is why I am working on a better benchmark solution here: https://github.com/solana-labs/rbpf/pull/197

Actually, you could also run some of your own by checking out that branch and adding: emit_stopwatch(jit, true)?; here and ?; emit_stopwatch(jit, false) here

jon-chuang commented 3 years ago

@Lichtso To follow up on the results produced here: https://github.com/solana-labs/solana/pull/17393#issuecomment-875262123 Here is the result of tracing and counting the opcodes in trace.out from bpf-ristretto with the following inputs:

../../solana/target/release/rbpf-cli --use jit  --input input.json target/deploy/ec_math.so --trace
Program input:
accounts []
insndata [1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0]

where the insndata corresponds to instruction::field_mul(two, minus_one), performing the multiplication 1000 times.

trace: and 6016 call 26039 stx 97494 ldx 116495 lsh 130027 mul 155007 rsh 231033 add 249108 mov 361187

Total: 1340351

Total isn: 1455581

The time taken is 2839670ns (2839 ns/op). This is a slowdown over native (20ns/op) of approximately 140x.

Assuming a 21ns ldx/stx time and a 4.0GHz CPU, we obtain the following slowdown over zero-overhead ldx/stx, assuming IPC = 2 (for predominantly ALU workload):

(97494 + 116495) 21 4 * 2 / 1455581 = 24.

The slowdown from a lightweight ldx/stx of 8 cycles is

((97494 + 116495) 8 2 + (1455581 - 97494 + 116495)) / 1455581 = 3.36.

This suggests a 7.1x performance boost is possible, depending on how fast the lightweight ldx/stx can be made.

This still leaves a 20x performance gap that cannot be closed...

jon-chuang commented 3 years ago

Btw, this is the CU/us for the above run, which is unbelievably high compared to mainnet's 50CU/us median: 546.389264264

Here is the CU/us for edwards_add: 444.989488437 edwards_mul: 432.879176471

This leads me to believe that somehow, the VM running on mainnet is not particularly optimised... perhaps the VM there is somehow built as debug rather than release... Or, mainnet programs are dominated by loads and stores...

But it still doesn't explain why Serum has such poor CU/us...

Lichtso commented 3 years ago

What does the time measurement of mainnet include? Just VM execution or account serialization & deserialization as well? Because the copying of accounts is also known to be a huge time sink.

jon-chuang commented 3 years ago

The data is obtained from: https://github.com/solana-labs/solana/pull/16984/files

Here is what is stated:

These numbers only include program execution (message processor, vm setup/takedown, parameter ser/de) but do not include account loading/storing

So it may not be an accurate reflection, as you say

Still, according to several sources, memcpy does about 20GB/s. So even a Serum memcpy with 1MB data would take 50us. That's a small fraction of the total wallclock time.

I measured this on an 8KiB load. It takes about 120us/MB.

Lichtso commented 3 years ago

If you want to compare the standalone with the one deployed on mainnet then you should only use the execution time and explicitly exclude parameter ser/de. That might also explain where the "missing" performance gap comes from.

jon-chuang commented 3 years ago

Here is the result of using the flattened, non-rust-call version of address translation, alongside turning off stack frame gaps, immediate sanitisation, and compute meter: 865838ns, or 865ns per field mul That's a 3.28x improvement.

Turning off environment register encryption: 559782ns, or 560ns per field mul.

That's 5x improvement from turning off the security features.

Still about 25x slower than native, but hey, we're making progress discovering sources of slowdown.

solana-labs / rbpf

Address Translation & Memory Model #193

Ideas to improve the situation: