Run cpu instruction calibration on a variety of hardware

What

Calibrate the cpu instructions on a variety of hardwares that the validators run on.

Why

The metering model is deterministic across all nodes. The model is currently calibrated on a single machine (M1) which may vary from actual hardware that validators use. This can make the actual compute time vary for the same amount of CPU instructions, which could affect ledger close time. The network resource limits need to be set conservatively w.r.t the worst case. We need to calibrate them on various hardware architecture in order to figure out the correct bonds.

I talked to @anupsdf about this and we concluded two points:

The key question isn't how many "model CPU instructions" a contract takes, it's how much time, and so we really will want to set our network limits by reasoning backwards from the observed virtual-instructions-per-unit-real-time value we see in the network, empirically, on the nodes we're running (which, as you mention, are likely to be different enough from our workstations to warrant empirical observation). This is fine, but it means that the actual instruction count values are mostly irrelevant. They're just a term in an equation we divide out to get the number we set the limit to. Eg. if the network says it's processing 20 virtual instructions per nanosecond and we want to limit contracts to 1ms, then we set the instruction limit to 20m instructions. But if it says 5 instructions per ns, we set the limit to 5m instructions. The actual "instructions number" doesn't matter to setting a "time target".
That said, it's a little confusing to think about and might be misleading to users to see "virtual instruction counts" that are much higher than the (unknown but plausibly estimatable) true instruction count for the machine they're on. So for sake of not-confusing people, it'd be good to calibrate the model instruction counts to their values as measured on x86-64 machines, because we expect most validators to be on that arch.

So .. I'm going to take this and just run calibration on the x86-64 machine I have here. Doesn't matter what its clock frequency is, we're only talking instruction counts of the cost centers.

Some investigation and results here (I meant to discuss this with @jayz22 but I'll make a note here for future reference too):

x64 actually gives very incorrect-seeming numbers when I try to calibrate
I think part of what's happening is that for memcpy (and mempcy-like cost centers, of which there are a few) we pass large buffers through to rust's slice-copying code which will engage a throughput-oriented fast path for memory copies -- using AVX2 instructions but with a relatively high setup/teardown overhead for the process.
I think we actually don't want to calibrate against this path at all; it's misleading since almost all calls to memcpy-like costs will be for much smaller single-struct or small-buffer sort of chunks of memory.
I think it might just make sense to define the cost for memcpy-like costs analytically, from reason, rather than from measurement. I think we can assume that a memory-copy can move, for example, 8 bytes per instruction on a machine with 8-byte (64-bit) words, and then just define the cost of moving N bytes as N/8 instructions (or perhaps 1 + N/8 so we never charge zero).
I also noticed while I was thinking about this and exploring the cost runners that there are multiple cost types that all probably cost the same thing. I think HostMemCpy, HostMemCmp, ValSet, ValDeser, MapEntry, VecEntry, VmMemRead and VmMemWrite should all logically be the same cost -- the cost of N bytes of main-memory access -- and we might consider merging them for simplicity.

Posting my calibration results on m1 and x86 (- m1, + x86, full outputs attached below):

-                cost_type     cpu_model_const_param     cpu_model_lin_param     mem_model_const_param     mem_model_lin_param
-             HostMemAlloc                      1123                       1                        16     128
-               HostMemCpy                        32                      24                         0     0
-               HostMemCmp                        24                      64                         0     0
-     DispatchHostFunction                       262                       0                         0     0
-              VisitObject                       158                       0                         0     0
-                   ValSer                       646                      66                        18     384
-                 ValDeser                      1127                      34                        16     128
-        ComputeSha256Hash                      2877                    4125                        40     0
-     ComputeEd25519PubKey                     25640                       0                         0     0
-                 MapEntry                        84                       0                         0     0
-                 VecEntry                        35                       0                         0     0
-         VerifyEd25519Sig                    400983                    2685                         0     0
-                VmMemRead                       182                      24                         0     0
-               VmMemWrite                       178                      25                         0     0
-          VmInstantiation                    916377                   68226                    129471     5080
-         InvokeVmFunction                      1128                       0                        14     0
-     ComputeKeccak256Hash                      2882                    3561                        40     0
- ComputeEcdsaSecp256k1Key                     37899                       0                         0     0
- ComputeEcdsaSecp256k1Sig                       224                       0                         0     0
- RecoverEcdsaSecp256k1Key                   1667731                       0                       201     0
-             Int256AddSub                      1714                       0                       119     0
-                Int256Mul                      2226                       0                       119     0
-                Int256Div                      2332                       0                       119     0
-                Int256Pow                      5223                       0                       119     0
-              Int256Shift                       415                       0                       119     0
-        ChaCha20DrawBytes                      4857                    2461                         0     0

+                cost_type     cpu_model_const_param     cpu_model_lin_param     mem_model_const_param     mem_model_lin_param
+             HostMemAlloc                       310                       0                        16     128
+               HostMemCpy                        52                       0                         0     0
+               HostMemCmp                        55                      36                         0     0
+     DispatchHostFunction                       239                       0                         0     0
+              VisitObject                        34                       0                         0     0
+                   ValSer                       564                       0                        18     384
+                 ValDeser                      1104                       0                        16     128
+        ComputeSha256Hash                      3943                    6812                        40     0
+     ComputeEd25519PubKey                     40356                       0                         0     0
+                 MapEntry                        55                       0                         0     0
+                 VecEntry                         0                       0                         0     0
+         VerifyEd25519Sig                    654651                    4288                         0     0
+                VmMemRead                       210                       0                         0     0
+               VmMemWrite                       209                       0                         0     0
+          VmInstantiation                    459816                   49469                    129471     5080
+         InvokeVmFunction                      1189                       0                        14     0
+     ComputeKeccak256Hash                      4076                    5962                        40     0
+ ComputeEcdsaSecp256k1Key                     58314                       0                         0     0
+ ComputeEcdsaSecp256k1Sig                       249                       0                         0     0
+ RecoverEcdsaSecp256k1Key                   2323402                       0                       181     0
+             Int256AddSub                      1620                       0                        99     0
+                Int256Mul                      2209                       0                        99     0
+                Int256Div                      2150                       0                        99     0
+                Int256Pow                      3925                       0                        99     0
+              Int256Shift                       379                       0                        99     0
+        ChaCha20DrawBytes                      2155                    1051                         0     0

The main differences are as @graydon pointed out, the memory related operations appear to be constant (with larger const factor) costs on x86. I believe this is what you are talking about?

I think the analytical approach make sense. I've noticed some of those memory-related calibration results are pretty sensitive to the size of the sample (e.g. VecEntry #1051 ) and haven't found a good way to get around that.

Re: cost type consolidation, I think it makes sense to consolidate some of those types, especially the {host, vm} mem-cmp/cpy/read/write ones. I will look into it further.

output_m1.txt output_x86.txt

(A bit of extra information, my x86 cpu is a Intel 2012Q2 model, with AVX (not AVX2) extention)

Re: cost type consolidation and using analytical model

HostMemCpy vs HostMemCmp: from what I understand (also from calibrated results), memcmp requires loading values from two memory locations and comparing them (2 MOV + 1 CMP). memcpy is logically just 1 MOV. So they should probably be two different analytical models, with the linear coefficient of memcmp being 3x larger (which also somewhat matches calibration results)?
VmMemRead and VmMemWrite: I think these can be consolidated intoHostMemCpy, since underneath it is just doingcopy_from_slice` (plus some small overhead of resolving the memory entity).
VecEntry and MapEntry: these two are just memory access can probably be consolidated into HostMemCpy. Although not sure if the coefficients should be same, since there is a bit of extra container logic like index bounds checking . (Calibration numbers do not provide a good guidance here. See https://github.com/stellar/rs-soroban-env/issues/1051)
ValSer and ValDeser: this one I'm least sure. Logically they are also just doing mem copying. However, there can be a fair amount of overhead due to xdr structuring, and recursion? Looking from M1 results, at least the linear coefficients are comparible to HostMemCpy/Cmp.

These are very crude analysis and is a bit stretching my low-level knowledge. @graydon let me know what you think.

ValSer, ValDeserare clearly different from simple memcpy after taking into account deep structure nesting. See https://github.com/stellar/rs-soroban-env/issues/1102

Re: cost type consolidation

WasmMemAlloc can be removed now (use HostMemAlloc instead), since we have moved away from the memory fuel concept and all memory allocation is now done on the host side via ResourceLimiter.

Just had a conversation with @MonsieurNicolas. He expressed concerns about calibration numbers not being accurate and reproducible due to the advanced instruction set (e.g. AVX, AVX2). While the first-principle models for mem copy works, the AVX might be messing with other calibration numbers on x86_64. So in order to have more confidence in the calibration numbers and improve reproducibility, he has suggested:

During calibration, compile to generic x86 target via march=x86-64, more info can be found here. This will still include some extensions such as MMX and SSE, but hopefully 1. they don't mess with calibration results too much (i.e. preserves the correct linear characteristics) 2. are ubiquitous enough that every node should have them.

I will give it a try.

hmm. avx2 is 10 years old, there's nothing in the field that doesn't speak avx2. I am not sure this is really related to the constant-factor-ness of our measurements on those machines -- if we really want to correct that fact I think we should figure out why it's happening rather than just fiddling with codegen options (which none of our users will fiddle with anyways)

stellar / rs-soroban-env

Run cpu instruction calibration on a variety of hardware #1020

What

Why