Closed jayz22 closed 1 year ago
I talked to @anupsdf about this and we concluded two points:
So .. I'm going to take this and just run calibration on the x86-64 machine I have here. Doesn't matter what its clock frequency is, we're only talking instruction counts of the cost centers.
Some investigation and results here (I meant to discuss this with @jayz22 but I'll make a note here for future reference too):
Posting my calibration results on m1 and x86 (- m1, + x86, full outputs attached below):
- cost_type cpu_model_const_param cpu_model_lin_param mem_model_const_param mem_model_lin_param
- HostMemAlloc 1123 1 16 128
- HostMemCpy 32 24 0 0
- HostMemCmp 24 64 0 0
- DispatchHostFunction 262 0 0 0
- VisitObject 158 0 0 0
- ValSer 646 66 18 384
- ValDeser 1127 34 16 128
- ComputeSha256Hash 2877 4125 40 0
- ComputeEd25519PubKey 25640 0 0 0
- MapEntry 84 0 0 0
- VecEntry 35 0 0 0
- VerifyEd25519Sig 400983 2685 0 0
- VmMemRead 182 24 0 0
- VmMemWrite 178 25 0 0
- VmInstantiation 916377 68226 129471 5080
- InvokeVmFunction 1128 0 14 0
- ComputeKeccak256Hash 2882 3561 40 0
- ComputeEcdsaSecp256k1Key 37899 0 0 0
- ComputeEcdsaSecp256k1Sig 224 0 0 0
- RecoverEcdsaSecp256k1Key 1667731 0 201 0
- Int256AddSub 1714 0 119 0
- Int256Mul 2226 0 119 0
- Int256Div 2332 0 119 0
- Int256Pow 5223 0 119 0
- Int256Shift 415 0 119 0
- ChaCha20DrawBytes 4857 2461 0 0
+ cost_type cpu_model_const_param cpu_model_lin_param mem_model_const_param mem_model_lin_param
+ HostMemAlloc 310 0 16 128
+ HostMemCpy 52 0 0 0
+ HostMemCmp 55 36 0 0
+ DispatchHostFunction 239 0 0 0
+ VisitObject 34 0 0 0
+ ValSer 564 0 18 384
+ ValDeser 1104 0 16 128
+ ComputeSha256Hash 3943 6812 40 0
+ ComputeEd25519PubKey 40356 0 0 0
+ MapEntry 55 0 0 0
+ VecEntry 0 0 0 0
+ VerifyEd25519Sig 654651 4288 0 0
+ VmMemRead 210 0 0 0
+ VmMemWrite 209 0 0 0
+ VmInstantiation 459816 49469 129471 5080
+ InvokeVmFunction 1189 0 14 0
+ ComputeKeccak256Hash 4076 5962 40 0
+ ComputeEcdsaSecp256k1Key 58314 0 0 0
+ ComputeEcdsaSecp256k1Sig 249 0 0 0
+ RecoverEcdsaSecp256k1Key 2323402 0 181 0
+ Int256AddSub 1620 0 99 0
+ Int256Mul 2209 0 99 0
+ Int256Div 2150 0 99 0
+ Int256Pow 3925 0 99 0
+ Int256Shift 379 0 99 0
+ ChaCha20DrawBytes 2155 1051 0 0
The main differences are as @graydon pointed out, the memory related operations appear to be constant (with larger const factor) costs on x86. I believe this is what you are talking about?
I think the analytical approach make sense. I've noticed some of those memory-related calibration results are pretty sensitive to the size of the sample (e.g. VecEntry
#1051 ) and haven't found a good way to get around that.
Re: cost type consolidation, I think it makes sense to consolidate some of those types, especially the {host, vm} mem-cmp/cpy/read/write ones. I will look into it further.
(A bit of extra information, my x86 cpu is a Intel 2012Q2 model, with AVX (not AVX2) extention)
Re: cost type consolidation and using analytical model
HostMemCpy
vs HostMemCmp
: from what I understand (also from calibrated results), memcmp
requires loading values from two memory locations and comparing them (2 MOV
+ 1 CMP
). memcpy
is logically just 1 MOV
. So they should probably be two different analytical models, with the linear coefficient of memcmp
being 3x larger (which also somewhat matches calibration results)?VmMemRead
and VmMemWrite: I think these can be consolidated into
HostMemCpy, since underneath it is just doing
copy_from_slice` (plus some small overhead of resolving the memory entity). VecEntry
and MapEntry
: these two are just memory access can probably be consolidated into HostMemCpy
. Although not sure if the coefficients should be same, since there is a bit of extra container logic like index bounds checking . (Calibration numbers do not provide a good guidance here. See https://github.com/stellar/rs-soroban-env/issues/1051)ValSer
and ValDeser
: this one I'm least sure. Logically they are also just doing mem copying. However, there can be a fair amount of overhead due to xdr structuring, and recursion? Looking from M1 results, at least the linear coefficients are comparible to HostMemCpy/Cmp.These are very crude analysis and is a bit stretching my low-level knowledge. @graydon let me know what you think.
ValSer
, ValDeser
are clearly different from simple memcpy after taking into account deep structure nesting. See https://github.com/stellar/rs-soroban-env/issues/1102
Re: cost type consolidation
WasmMemAlloc
can be removed now (use HostMemAlloc
instead), since we have moved away from the memory fuel concept and all memory allocation is now done on the host side via ResourceLimiter
.
Just had a conversation with @MonsieurNicolas. He expressed concerns about calibration numbers not being accurate and reproducible due to the advanced instruction set (e.g. AVX, AVX2). While the first-principle models for mem copy works, the AVX might be messing with other calibration numbers on x86_64. So in order to have more confidence in the calibration numbers and improve reproducibility, he has suggested:
march=x86-64
, more info can be found here. This will still include some extensions such as MMX and SSE, but hopefully 1. they don't mess with calibration results too much (i.e. preserves the correct linear characteristics) 2. are ubiquitous enough that every node should have them.I will give it a try.
hmm. avx2 is 10 years old, there's nothing in the field that doesn't speak avx2. I am not sure this is really related to the constant-factor-ness of our measurements on those machines -- if we really want to correct that fact I think we should figure out why it's happening rather than just fiddling with codegen options (which none of our users will fiddle with anyways)
What
Calibrate the cpu instructions on a variety of hardwares that the validators run on.
Why
The metering model is deterministic across all nodes. The model is currently calibrated on a single machine (M1) which may vary from actual hardware that validators use. This can make the actual compute time vary for the same amount of CPU instructions, which could affect ledger close time. The network resource limits need to be set conservatively w.r.t the worst case. We need to calibrate them on various hardware architecture in order to figure out the correct bonds.