stellar / rs-soroban-env

Rust environment for Soroban contracts.
Apache License 2.0
61 stars 43 forks source link

Run cpu instruction calibration on a variety of hardware #1020

Closed jayz22 closed 1 year ago

jayz22 commented 1 year ago

What

Calibrate the cpu instructions on a variety of hardwares that the validators run on.

Why

The metering model is deterministic across all nodes. The model is currently calibrated on a single machine (M1) which may vary from actual hardware that validators use. This can make the actual compute time vary for the same amount of CPU instructions, which could affect ledger close time. The network resource limits need to be set conservatively w.r.t the worst case. We need to calibrate them on various hardware architecture in order to figure out the correct bonds.

graydon commented 1 year ago

I talked to @anupsdf about this and we concluded two points:

  1. The key question isn't how many "model CPU instructions" a contract takes, it's how much time, and so we really will want to set our network limits by reasoning backwards from the observed virtual-instructions-per-unit-real-time value we see in the network, empirically, on the nodes we're running (which, as you mention, are likely to be different enough from our workstations to warrant empirical observation). This is fine, but it means that the actual instruction count values are mostly irrelevant. They're just a term in an equation we divide out to get the number we set the limit to. Eg. if the network says it's processing 20 virtual instructions per nanosecond and we want to limit contracts to 1ms, then we set the instruction limit to 20m instructions. But if it says 5 instructions per ns, we set the limit to 5m instructions. The actual "instructions number" doesn't matter to setting a "time target".
  2. That said, it's a little confusing to think about and might be misleading to users to see "virtual instruction counts" that are much higher than the (unknown but plausibly estimatable) true instruction count for the machine they're on. So for sake of not-confusing people, it'd be good to calibrate the model instruction counts to their values as measured on x86-64 machines, because we expect most validators to be on that arch.

So .. I'm going to take this and just run calibration on the x86-64 machine I have here. Doesn't matter what its clock frequency is, we're only talking instruction counts of the cost centers.

graydon commented 1 year ago

Some investigation and results here (I meant to discuss this with @jayz22 but I'll make a note here for future reference too):

jayz22 commented 1 year ago

Posting my calibration results on m1 and x86 (- m1, + x86, full outputs attached below):

-                cost_type     cpu_model_const_param     cpu_model_lin_param     mem_model_const_param     mem_model_lin_param
-             HostMemAlloc                      1123                       1                        16     128
-               HostMemCpy                        32                      24                         0     0
-               HostMemCmp                        24                      64                         0     0
-     DispatchHostFunction                       262                       0                         0     0
-              VisitObject                       158                       0                         0     0
-                   ValSer                       646                      66                        18     384
-                 ValDeser                      1127                      34                        16     128
-        ComputeSha256Hash                      2877                    4125                        40     0
-     ComputeEd25519PubKey                     25640                       0                         0     0
-                 MapEntry                        84                       0                         0     0
-                 VecEntry                        35                       0                         0     0
-         VerifyEd25519Sig                    400983                    2685                         0     0
-                VmMemRead                       182                      24                         0     0
-               VmMemWrite                       178                      25                         0     0
-          VmInstantiation                    916377                   68226                    129471     5080
-         InvokeVmFunction                      1128                       0                        14     0
-     ComputeKeccak256Hash                      2882                    3561                        40     0
- ComputeEcdsaSecp256k1Key                     37899                       0                         0     0
- ComputeEcdsaSecp256k1Sig                       224                       0                         0     0
- RecoverEcdsaSecp256k1Key                   1667731                       0                       201     0
-             Int256AddSub                      1714                       0                       119     0
-                Int256Mul                      2226                       0                       119     0
-                Int256Div                      2332                       0                       119     0
-                Int256Pow                      5223                       0                       119     0
-              Int256Shift                       415                       0                       119     0
-        ChaCha20DrawBytes                      4857                    2461                         0     0

+                cost_type     cpu_model_const_param     cpu_model_lin_param     mem_model_const_param     mem_model_lin_param
+             HostMemAlloc                       310                       0                        16     128
+               HostMemCpy                        52                       0                         0     0
+               HostMemCmp                        55                      36                         0     0
+     DispatchHostFunction                       239                       0                         0     0
+              VisitObject                        34                       0                         0     0
+                   ValSer                       564                       0                        18     384
+                 ValDeser                      1104                       0                        16     128
+        ComputeSha256Hash                      3943                    6812                        40     0
+     ComputeEd25519PubKey                     40356                       0                         0     0
+                 MapEntry                        55                       0                         0     0
+                 VecEntry                         0                       0                         0     0
+         VerifyEd25519Sig                    654651                    4288                         0     0
+                VmMemRead                       210                       0                         0     0
+               VmMemWrite                       209                       0                         0     0
+          VmInstantiation                    459816                   49469                    129471     5080
+         InvokeVmFunction                      1189                       0                        14     0
+     ComputeKeccak256Hash                      4076                    5962                        40     0
+ ComputeEcdsaSecp256k1Key                     58314                       0                         0     0
+ ComputeEcdsaSecp256k1Sig                       249                       0                         0     0
+ RecoverEcdsaSecp256k1Key                   2323402                       0                       181     0
+             Int256AddSub                      1620                       0                        99     0
+                Int256Mul                      2209                       0                        99     0
+                Int256Div                      2150                       0                        99     0
+                Int256Pow                      3925                       0                        99     0
+              Int256Shift                       379                       0                        99     0
+        ChaCha20DrawBytes                      2155                    1051                         0     0

The main differences are as @graydon pointed out, the memory related operations appear to be constant (with larger const factor) costs on x86. I believe this is what you are talking about?

I think the analytical approach make sense. I've noticed some of those memory-related calibration results are pretty sensitive to the size of the sample (e.g. VecEntry #1051 ) and haven't found a good way to get around that.

Re: cost type consolidation, I think it makes sense to consolidate some of those types, especially the {host, vm} mem-cmp/cpy/read/write ones. I will look into it further.

output_m1.txt output_x86.txt

(A bit of extra information, my x86 cpu is a Intel 2012Q2 model, with AVX (not AVX2) extention)

jayz22 commented 1 year ago

Re: cost type consolidation and using analytical model

These are very crude analysis and is a bit stretching my low-level knowledge. @graydon let me know what you think.

jayz22 commented 1 year ago

ValSer, ValDeserare clearly different from simple memcpy after taking into account deep structure nesting. See https://github.com/stellar/rs-soroban-env/issues/1102

jayz22 commented 1 year ago

Re: cost type consolidation

WasmMemAlloc can be removed now (use HostMemAlloc instead), since we have moved away from the memory fuel concept and all memory allocation is now done on the host side via ResourceLimiter.

jayz22 commented 1 year ago

Just had a conversation with @MonsieurNicolas. He expressed concerns about calibration numbers not being accurate and reproducible due to the advanced instruction set (e.g. AVX, AVX2). While the first-principle models for mem copy works, the AVX might be messing with other calibration numbers on x86_64. So in order to have more confidence in the calibration numbers and improve reproducibility, he has suggested:

I will give it a try.

graydon commented 1 year ago

hmm. avx2 is 10 years old, there's nothing in the field that doesn't speak avx2. I am not sure this is really related to the constant-factor-ness of our measurements on those machines -- if we really want to correct that fact I think we should figure out why it's happening rather than just fiddling with codegen options (which none of our users will fiddle with anyways)