Perf idea: better runtime support for contracts that use less memory

akhi3030 commented 5 months ago

Our runtime today starts contracts with a memory of 64 MiB regardless of what the contract declares. The contract can then bring that number up to 128 MiB. We suspect that many contracts are actually using much less memory than this 64MiB. If someone were to write a contract that uses as little memory as possible (by specifying a small initial memory and using memory.grow as appropriate), then maybe we can optimise our runtime to execute such contracts more efficiently.

See also #10851 for a preliminary investigation on the matter.

Ekleog-NEAR commented 5 months ago

What we could do would be to cap the initial memory size to 64MiB, and to use the contract-provided initial memory size if it’s below. The good thing is, so long as the contract is properly compiled, it should not break any contract. The main drawback is implementation complexity (we currently basically do not read the value we replace, meaning we can’t hack around to figure out performance). Also, due to the huge slowdown zeroing out large memories, we would need some way to unmap/remap the memory pages when the memory actually used is too large, instead of zeroing it out, or we’d be vulnerable to huge undercharging attacks.

So, to keep in mind if/when we come back to this:

We need to allow contracts to start with up to 64MiB if they request it, or we’ll break contracts. We likely will do that by reading the contract-declared memory size and then using min(contract-declared memory size, 64MiB) as the initial memory size
We need to unmap the memory instead of zeroing it out, especially if the initial memory size is too large, as we’d be vulnerable to very severe undercharging otherwise

Ekleog-NEAR commented 5 months ago

Current plan:

Dump the larger contracts, verifying what their contract-declared initial memory size is
Crunch some numbers to check whether that would be a size that would make sense to zero out manually or not
If yes, actually implement the full feature

Ekleog-NEAR commented 5 months ago

It looks like all the contracts from our top 10 most gas-using accounts request exactly 17 pages of initial wasm memory. My guess is this is the current minimum implied by rustc, assuming the contracts are all written in rust.

So I reran the experiment from #10851, except with a 20-pages hardcoded initial memory at the 3 places the #10851 protocol change patch had a 1 hardcoded. This is still very much a hack, but should be enough to at least get a rough answer of whether there’s actually gains to be hoped for by implementing the full feature.

The results are as follow:

For shard 2 and 3, the test somehow explodes the db, so there’s no answer here. Likely due to some other contract actually requiring more than 20 pages.
Shards 0 and 5 gets a significant (~1.9x faster) improvement
Shards 1 and 4 get a small (~1.1x faster) improvement

So there does seem to be some pretty significant gains to be had here. However, we will need a very good plan to test the changes, because of the results from shard 2 and 3 proving that there are lots of risks if the change is ill-implemented (like the hackish change used to test the potential gains)

Current plan:

Actually implement the full feature
Check that it actually makes things faster
Somehow determine at what contract memory size the effect of zeroing out the memory is actually worse than re-mmap’ing it
Update the estimator so that it gives numbers for contracts that actually use the full 64MiB memory, to not underestimate the worst-case changes
(Should be tracked in a future issue:) Add a separate fee for initializing contract memory, proportional to the contract-requested initial memory, to incentivize contracts to lower that?

Ekleog-NEAR commented 4 months ago

TL;DR

Benchmark results of the full implementation on our current workload

I finished implementing the full change (minus the "protocol version bump" part of it) and benchmarked what the speedup would be if it were to be fully activated. The results are unfortunately as I initially feared: the time spent in MMAP is actually mostly zeroing out memory, and thus we cannot optimize it out.

Actually, this change results in a 20~30% slowdown even in the best-case scenario where we know what memory our contracts request ahead of time. The hope for speedup that came from the hackish experiment results was actually due to it being hackish, and likely not resetting properly the whole memory, or just resulting in contract crashes.

Further work that could happen on this topic

We could maybe hope for some improvements if contracts did active work to reduce their initial memory requirements, but they currently all request ~1MiB, due to rustc defaults. We could ask them to compile with -C link-arg=-zstack-size=16 or similar to reduce that initial stack size further.

Then we could maybe hope for the ~10% performance win that could be seen. Or maybe we would still see the 20-30% performance slowdown that we saw with 17 initial pages. In order to test we’d need some reproducible workload on a contract that we control, and that we could compile with varying numbers of memory pages.

In order to properly implement this, we would also need some incentive for contract authors to care about this rustc flag, and thus a new gas cost that’d depend on the size of the initially-requested memory. This would, again, be quite a lot of additional effort.

I’ve been suggesting dropping the idea before we even started to initially work on it due to the fact that the linux kernel is often pretty well-implemented, and it has access to more information than we could have as it knows which pages have actually been touched. So I’m once more suggesting that we drop the idea, because mmap should have minimal overhead besides just its memory zeroing, and we cannot just not do the memory zeroing.

If we ever have such a reproducible benchmark ready then it might make sense to at least see whether the patch does bring in benefits or not. But until then I don’t think it makes sense to spend more time on this optimization idea.

Detailed results

Control group

The control group, neard-base in the benchmark results, is based on nearcore on the c3ad36fa73524db4a3a536a49e1096b2873f00c7 commit. This commit is itself, the master branch at the a9ddf9c63bee66fc571aac3ad294fbfc31cbaabe commit plus a prerequisite PR that should have no runtime (and thus no performance) impact.

Patch

commit d91e1d9bd71fc3f544e2d3b3c2dcb0a65542fc24
Author: Léo Gaspard <leo@near.org>
Date:   Wed Apr 24 10:34:23 2024 +0000

    wip

diff --git a/Cargo.lock b/Cargo.lock
index 9b7f3bf9d..674bcf206 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4929,11 +4929,13 @@ dependencies = [
  "borsh 1.0.0",
  "bytesize",
  "cov-mark",
+ "crossbeam",
  "ed25519-dalek",
  "enum-map",
  "expect-test",
  "finite-wasm",
  "hex",
+ "lazy_static",
  "lru 0.12.3",
  "memoffset 0.8.0",
  "near-crypto",
diff --git a/core/parameters/src/parameter_table.rs b/core/parameters/src/parameter_table.rs
index b4d68efb2..0b9a8fb5c 100644
--- a/core/parameters/src/parameter_table.rs
+++ b/core/parameters/src/parameter_table.rs
@@ -326,6 +326,7 @@ impl TryFrom<&ParameterTable> for RuntimeConfig {
                 function_call_weight: params.get(Parameter::FunctionCallWeight)?,
                 eth_implicit_accounts: params.get(Parameter::EthImplicitAccounts)?,
                 yield_resume_host_functions: params.get(Parameter::YieldResume)?,
+                lower_initial_contract_memory: true, // TODO: protocol change
             },
             account_creation_config: AccountCreationConfig {
                 min_allowed_top_level_account_length: params
diff --git a/core/parameters/src/view.rs b/core/parameters/src/view.rs
index 39487c249..06a6d656c 100644
--- a/core/parameters/src/view.rs
+++ b/core/parameters/src/view.rs
@@ -227,6 +227,8 @@ pub struct VMConfigView {
     pub eth_implicit_accounts: bool,
     /// See [`VMConfig::yield_resume_host_functions`].
     pub yield_resume_host_functions: bool,
+    /// See [`VMConfig::lower_initial_contract_memory`].
+    pub lower_initial_contract_memory: bool,

     /// Describes limits for VM and Runtime.
     ///
@@ -253,6 +255,7 @@ impl From<crate::vm::Config> for VMConfigView {
             vm_kind: config.vm_kind,
             eth_implicit_accounts: config.eth_implicit_accounts,
             yield_resume_host_functions: config.yield_resume_host_functions,
+            lower_initial_contract_memory: config.lower_initial_contract_memory,
         }
     }
 }
@@ -275,6 +278,7 @@ impl From<VMConfigView> for crate::vm::Config {
             vm_kind: view.vm_kind,
             eth_implicit_accounts: view.eth_implicit_accounts,
             yield_resume_host_functions: view.yield_resume_host_functions,
+            lower_initial_contract_memory: view.lower_initial_contract_memory,
         }
     }
 }
diff --git a/core/parameters/src/vm.rs b/core/parameters/src/vm.rs
index f2f7a46af..7c721e60f 100644
--- a/core/parameters/src/vm.rs
+++ b/core/parameters/src/vm.rs
@@ -192,6 +192,9 @@ pub struct Config {
     /// Enable the `promise_yield_create` and `promise_yield_resume` host functions.
     pub yield_resume_host_functions: bool,

+    /// Enable the `LowerInitialContractMemory` protocol feature.
+    pub lower_initial_contract_memory: bool,
+
     /// Describes limits for VM and Runtime.
     pub limit_config: LimitConfig,
 }
@@ -224,6 +227,7 @@ impl Config {
         self.ed25519_verify = true;
         self.math_extension = true;
         self.implicit_account_creation = true;
+        self.lower_initial_contract_memory = true;
     }
 }

diff --git a/runtime/near-vm-runner/Cargo.toml b/runtime/near-vm-runner/Cargo.toml
index 042bd0bab..2e5910369 100644
--- a/runtime/near-vm-runner/Cargo.toml
+++ b/runtime/near-vm-runner/Cargo.toml
@@ -17,9 +17,11 @@ anyhow = { workspace = true, optional = true }
 base64.workspace = true
 bn.workspace = true
 borsh.workspace = true
+crossbeam.workspace = true
 ed25519-dalek.workspace = true
 enum-map.workspace = true
 finite-wasm = { workspace = true, features = ["instrument"], optional = true }
+lazy_static.workspace = true
 lru = "0.12.3"
 memoffset.workspace = true
 num-rational.workspace = true
diff --git a/runtime/near-vm-runner/src/cache.rs b/runtime/near-vm-runner/src/cache.rs
index 11b238a66..ae4c19c25 100644
--- a/runtime/near-vm-runner/src/cache.rs
+++ b/runtime/near-vm-runner/src/cache.rs
@@ -80,6 +80,7 @@ impl CompiledContract {
 #[derive(Debug, Clone, PartialEq, BorshDeserialize, BorshSerialize)]
 pub struct CompiledContractInfo {
     pub wasm_bytes: u64,
+    pub initial_memory_pages: u32,
     pub compiled: CompiledContract,
 }

@@ -314,6 +315,7 @@ impl ContractRuntimeCache for FilesystemContractRuntimeCache {
             }
         }
         temp_file.write_all(&value.wasm_bytes.to_le_bytes())?;
+        temp_file.write_all(&value.initial_memory_pages.to_le_bytes())?;
         let temp_filename = temp_file.into_temp_path();
         // This is atomic, so there wouldn't be instances where getters see an intermediate state.
         rustix::fs::renameat(&self.state.dir, &*temp_filename, &self.state.dir, final_filename)?;
@@ -351,15 +353,17 @@ impl ContractRuntimeCache for FilesystemContractRuntimeCache {
             // The file turns out to be empty/truncated? Treat as if there's no cached file.
             return Ok(None);
         }
-        let wasm_bytes = u64::from_le_bytes(buffer[buffer.len() - 8..].try_into().unwrap());
-        let tag = buffer[buffer.len() - 9];
-        buffer.truncate(buffer.len() - 9);
+        let initial_memory_pages = u32::from_le_bytes(buffer[buffer.len() - 4..].try_into().unwrap());
+        let wasm_bytes = u64::from_le_bytes(buffer[buffer.len() - 12..buffer.len() - 4].try_into().unwrap());
+        let tag = buffer[buffer.len() - 13];
+        buffer.truncate(buffer.len() - 13);
         Ok(match tag {
             CODE_TAG => {
-                Some(CompiledContractInfo { wasm_bytes, compiled: CompiledContract::Code(buffer) })
+                Some(CompiledContractInfo { wasm_bytes, initial_memory_pages, compiled: CompiledContract::Code(buffer) })
             }
             ERROR_TAG => Some(CompiledContractInfo {
                 wasm_bytes,
+                initial_memory_pages,
                 compiled: CompiledContract::CompileModuleError(borsh::from_slice(&buffer)?),
             }),
             // File is malformed? For this code, since we're talking about a cache lets just treat
diff --git a/runtime/near-vm-runner/src/near_vm_runner/memory.rs b/runtime/near-vm-runner/src/near_vm_runner/memory.rs
index 7d8b7f570..37432de63 100644
--- a/runtime/near-vm-runner/src/near_vm_runner/memory.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner/memory.rs
@@ -26,7 +26,6 @@ impl NearVmMemory {
         )?)))
     }

-    #[cfg(unused)] // TODO: this will be used once we reuse the memories
     pub fn into_preallocated(self) -> Result<PreallocatedMemory, String> {
         Ok(PreallocatedMemory(
             Arc::into_inner(self.0)
diff --git a/runtime/near-vm-runner/src/near_vm_runner/mod.rs b/runtime/near-vm-runner/src/near_vm_runner/mod.rs
index f2c7b48b5..331d2e407 100644
--- a/runtime/near-vm-runner/src/near_vm_runner/mod.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner/mod.rs
@@ -38,7 +38,7 @@ enum NearVmCompiler {
 //  major version << 6
 //  minor version
 const VM_CONFIG: NearVmConfig = NearVmConfig {
-    seed: (2 << 29) | (2 << 6) | 2,
+    seed: (2 << 29) | (2 << 6) | 15,
     engine: NearVmEngine::Universal,
     compiler: NearVmCompiler::Singlepass,
 };
diff --git a/runtime/near-vm-runner/src/near_vm_runner/runner.rs b/runtime/near-vm-runner/src/near_vm_runner/runner.rs
index eb0cdf12d..2e6f89e35 100644
--- a/runtime/near-vm-runner/src/near_vm_runner/runner.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner/runner.rs
@@ -8,6 +8,7 @@ use crate::logic::errors::{
 use crate::logic::gas_counter::FastGasCounter;
 use crate::logic::types::PromiseResult;
 use crate::logic::{Config, External, VMContext, VMLogic, VMOutcome};
+use crate::near_vm_runner::memory::PreallocatedMemory;
 use crate::near_vm_runner::{NearVmCompiler, NearVmEngine};
 use crate::runner::VMResult;
 use crate::{
@@ -160,7 +161,7 @@ impl NearVM {
     pub(crate) fn compile_uncached(
         &self,
         code: &ContractCode,
-    ) -> Result<UniversalExecutable, CompilationError> {
+    ) -> Result<(u32, UniversalExecutable), CompilationError> {
         let _span = tracing::debug_span!(target: "vm", "NearVM::compile_uncached").entered();
         let prepared_code = prepare::prepare_contract(code.code(), &self.config, VMKind::NearVm)
             .map_err(CompilationError::PrepareError)?;
@@ -169,6 +170,7 @@ impl NearVM {
             matches!(self.engine.validate(&prepared_code), Ok(_)),
             "near_vm failed to validate the prepared code"
         );
+        let initial_memory_pages = self.initial_memory_pages_for(&prepared_code)?;
         let executable = self
             .engine
             .compile_universal(&prepared_code, &self)
@@ -176,19 +178,20 @@ impl NearVM {
                 tracing::error!(?err, "near_vm failed to compile the prepared code (this is defense-in-depth, the error was recovered from but should be reported to pagoda)");
                 CompilationError::WasmerCompileError { msg: err.to_string() }
             })?;
-        Ok(executable)
+        Ok((initial_memory_pages, executable))
     }

     fn compile_and_cache(
         &self,
         code: &ContractCode,
         cache: &dyn ContractRuntimeCache,
-    ) -> Result<Result<UniversalExecutable, CompilationError>, CacheError> {
+    ) -> Result<Result<(u32, UniversalExecutable), CompilationError>, CacheError> {
         let executable_or_error = self.compile_uncached(code);
         let key = get_contract_cache_key(*code.hash(), &self.config);
         let record = CompiledContractInfo {
             wasm_bytes: code.code().len() as u64,
-            compiled: match &executable_or_error {
+            initial_memory_pages: executable_or_error.as_ref().map(|r| r.0).unwrap_or(self.config.limit_config.initial_memory_pages),
+            compiled: match executable_or_error.as_ref().map(|r| &r.1) {
                 Ok(executable) => {
                     let code = executable
                         .serialize()
@@ -220,10 +223,10 @@ impl NearVM {
         method_name: &str,
         closure: impl FnOnce(VMMemory, VMLogic<'_>, &VMArtifact) -> Result<VMOutcome, VMRunnerError>,
     ) -> VMResult<VMOutcome> {
-        // (wasm code size, compilation result)
-        type MemoryCacheType = (u64, Result<VMArtifact, CompilationError>);
+        // (wasm code size, initial memory pages, compilation result)
+        type MemoryCacheType = (u64, u32, Result<VMArtifact, CompilationError>);
         let to_any = |v: MemoryCacheType| -> Box<dyn std::any::Any + Send> { Box::new(v) };
-        let (wasm_bytes, artifact_result) = cache.memory_cache().try_lookup(
+        let (wasm_bytes, initial_memory_pages, artifact_result) = cache.memory_cache().try_lookup(
             code_hash,
             || match code {
                 None => {
@@ -243,9 +246,9 @@ impl NearVM {
                     };

                     match &code.compiled {
-                        CompiledContract::CompileModuleError(err) => {
-                            Ok::<_, VMRunnerError>(to_any((code.wasm_bytes, Err(err.clone()))))
-                        }
+                        CompiledContract::CompileModuleError(err) => Ok::<_, VMRunnerError>(
+                            to_any((code.wasm_bytes, code.initial_memory_pages, Err(err.clone()))),
+                        ),
                         CompiledContract::Code(serialized_module) => {
                             let _span =
                                 tracing::debug_span!(target: "vm", "NearVM::load_from_fs_cache")
@@ -269,7 +272,11 @@ impl NearVM {
                                     .load_universal_executable_ref(&executable)
                                     .map(Arc::new)
                                     .map_err(|err| VMRunnerError::LoadingError(err.to_string()))?;
-                                Ok(to_any((code.wasm_bytes, Ok(artifact))))
+                                Ok(to_any((
+                                    code.wasm_bytes,
+                                    code.initial_memory_pages,
+                                    Ok(artifact),
+                                )))
                             }
                         }
                     }
@@ -277,34 +284,39 @@ impl NearVM {
                 Some(code) => {
                     let _span =
                         tracing::debug_span!(target: "vm", "NearVM::build_from_source").entered();
-                    Ok(to_any((
-                        code.code().len() as u64,
+                    let (initial_memory_pages, compiled) =
                         match self.compile_and_cache(code, cache)? {
-                            Ok(executable) => Ok(self
-                                .engine
-                                .load_universal_executable(&executable)
-                                .map(Arc::new)
-                                .map_err(|err| VMRunnerError::LoadingError(err.to_string()))?),
-                            Err(err) => Err(err),
-                        },
-                    )))
+                            Ok((initial_memory_pages, executable)) => (
+                                initial_memory_pages,
+                                Ok(self
+                                    .engine
+                                    .load_universal_executable(&executable)
+                                    .map(Arc::new)
+                                    .map_err(|err| VMRunnerError::LoadingError(err.to_string()))?),
+                            ),
+                            Err(err) => (0, Err(err)),
+                        };
+                    Ok(to_any((code.code().len() as u64, initial_memory_pages, compiled)))
                 }
             },
             move |value| {
                 let _span =
                     tracing::debug_span!(target: "vm", "NearVM::load_from_mem_cache").entered();
-                let &(wasm_bytes, ref downcast) = value
+                let &(wasm_bytes, initial_memory_pages, ref downcast) = value
                     .downcast_ref::<MemoryCacheType>()
                     .expect("downcast should always succeed");

-                (wasm_bytes, downcast.clone())
+                (wasm_bytes, initial_memory_pages, downcast.clone())
             },
         )?;

+        lazy_static::lazy_static! {
+            static ref MEMORIES: crossbeam::queue::ArrayQueue<PreallocatedMemory> = crossbeam::queue::ArrayQueue::new(16);
+        }
         let mut memory = NearVmMemory::new(
-            self.config.limit_config.initial_memory_pages,
+            initial_memory_pages,
             self.config.limit_config.max_memory_pages,
-            None, // TODO: this should actually reuse the memories
+            MEMORIES.pop(),
         )
         .expect("Cannot create memory for a contract call");
         // FIXME: this mostly duplicates the `run_module` method.
@@ -323,12 +335,38 @@ impl NearVM {
                 if let Err(e) = result {
                     return Ok(VMOutcome::abort(logic, e));
                 }
-                closure(vmmemory, logic, &artifact)
+                let res = closure(vmmemory, logic, &artifact);
+                if let Ok(mmap) = memory.into_preallocated() {
+                    tracing::info!("Reusing a memory");
+                    let _ = MEMORIES.push(mmap);
+                } else {
+                    tracing::error!("Not reusing a memory")
+                }
+                res
             }
             Err(e) => Ok(VMOutcome::abort(logic, FunctionCallError::CompilationError(e))),
         }
     }

+    fn initial_memory_pages_for(&self, code: &[u8]) -> Result<u32, CompilationError> {
+        let parser = wasmparser::Parser::new(0);
+        for payload in parser.parse_all(code) {
+            if let Ok(wasmparser::Payload::ImportSection(reader)) = payload {
+                for mem in reader {
+                    if let Ok(mem) = mem {
+                        if let wasmparser::ImportSectionEntryType::Memory(
+                            wasmparser::MemoryType::M32 { limits, .. },
+                        ) = mem.ty
+                        {
+                            return Ok(limits.initial);
+                        }
+                    }
+                }
+            }
+        }
+        panic!("Tried running a contract that was not prepared with a memory import");
+    }
+
     fn run_method(
         &self,
         artifact: &VMArtifact,
diff --git a/runtime/near-vm-runner/src/prepare/prepare_v2.rs b/runtime/near-vm-runner/src/prepare/prepare_v2.rs
index 894dea6fd..ada52e31b 100644
--- a/runtime/near-vm-runner/src/prepare/prepare_v2.rs
+++ b/runtime/near-vm-runner/src/prepare/prepare_v2.rs
@@ -240,8 +240,37 @@ impl<'a> PrepareContext<'a> {
     }

     fn memory_import(&self) -> wasm_encoder::EntityType {
+        // First, figure out the requested initial memory
+        // This parsing should be fast enough, as it can skip over whole sections
+        // And considering the memory section is after the import section, we will not
+        // have parsed it yet in the "regular" parse when calling this function.
+        let mut requested_initial_memory = None;
+        let parser = wp::Parser::new(0);
+        'outer: for payload in parser.parse_all(self.code) {
+            if let Ok(wp::Payload::MemorySection(reader)) = payload {
+                for mem in reader {
+                    if let Ok(mem) = mem {
+                        if requested_initial_memory.is_some() {
+                            requested_initial_memory = None;
+                            break 'outer;
+                        }
+                        requested_initial_memory = Some(mem.initial);
+                    }
+                }
+            }
+        }
+
+        // Then, generate a memory import, that has at most the limit-configured initial memory,
+        // but tries to get that number down by using the contract-provided data.
+        let max_initial_memory = requested_initial_memory.unwrap_or(u64::MAX);
+        let config_initial_memory = u64::from(self.config.limit_config.initial_memory_pages);
+        let initial_memory = if self.config.lower_initial_contract_memory {
+            std::cmp::min(max_initial_memory, config_initial_memory)
+        } else {
+            config_initial_memory
+        };
         wasm_encoder::EntityType::Memory(wasm_encoder::MemoryType {
-            minimum: u64::from(self.config.limit_config.initial_memory_pages),
+            minimum: initial_memory,
             maximum: Some(u64::from(self.config.limit_config.max_memory_pages)),
             memory64: false,
             shared: false,
diff --git a/runtime/near-vm-runner/src/wasmer2_runner.rs b/runtime/near-vm-runner/src/wasmer2_runner.rs
index d79e6b1cd..c7f952f56 100644
--- a/runtime/near-vm-runner/src/wasmer2_runner.rs
+++ b/runtime/near-vm-runner/src/wasmer2_runner.rs
@@ -300,6 +300,7 @@ impl Wasmer2VM {
         if let Some(cache) = cache {
             let record = CompiledContractInfo {
                 wasm_bytes: code.code().len() as u64,
+                initial_memory_pages: self.config.limit_config.initial_memory_pages,
                 compiled: match &executable_or_error {
                     Ok(executable) => {
                         let code = executable
diff --git a/runtime/near-vm-runner/src/wasmer_runner.rs b/runtime/near-vm-runner/src/wasmer_runner.rs
index 59bf3bf81..f2ea9f06d 100644
--- a/runtime/near-vm-runner/src/wasmer_runner.rs
+++ b/runtime/near-vm-runner/src/wasmer_runner.rs
@@ -268,6 +268,7 @@ impl Wasmer0VM {
         if let Some(cache) = cache {
             let record = CompiledContractInfo {
                 wasm_bytes: code.code().len() as u64,
+                initial_memory_pages: self.config.limit_config.initial_memory_pages,
                 compiled: match &module_or_error {
                     Ok(module) => {
                         let code = module
diff --git a/runtime/near-vm/vm/src/memory/linear_memory.rs b/runtime/near-vm/vm/src/memory/linear_memory.rs
index 6e786e00c..43f84b8bc 100644
--- a/runtime/near-vm/vm/src/memory/linear_memory.rs
+++ b/runtime/near-vm/vm/src/memory/linear_memory.rs
@@ -141,7 +141,7 @@ impl LinearMemory {
         let mapped_pages = memory.minimum;
         let mapped_bytes = mapped_pages.bytes();

-        let alloc = if let Some(alloc) = from_mmap {
+        let alloc = if let Some(mut alloc) = from_mmap {
             // For now we always request the same size, because our prepare step hardcodes a maximum size
             // of 64 MiB. This could change in the future, at which point this assert will start triggering
             // and we’ll need to think of a better way to handle things.
@@ -150,6 +150,7 @@ impl LinearMemory {
                 request_bytes,
                 "Multiple data memory mmap's had different maximal lengths"
             );
+            alloc.make_accessible(mapped_bytes.0).map_err(MemoryError::Region)?;
             alloc
         } else {
             Mmap::accessible_reserved(mapped_bytes.0, request_bytes).map_err(MemoryError::Region)?
diff --git a/runtime/near-vm/vm/src/mmap.rs b/runtime/near-vm/vm/src/mmap.rs
index 9df6e1e65..b5985454b 100644
--- a/runtime/near-vm/vm/src/mmap.rs
+++ b/runtime/near-vm/vm/src/mmap.rs
@@ -227,6 +227,9 @@ impl Mmap {
     pub fn reset(&mut self) -> Result<(), String> {
         unsafe {
             if self.accessible_len > 0 {
+                if self.accessible_len > 18 * 64 * 1024 {
+                    return Err(String::from("too big memories are not worth resetting"));
+                }
                 self.as_mut_ptr().write_bytes(0, self.accessible_len);
                 region::protect(self.as_ptr(), self.accessible_len, region::Protection::NONE)
                     .map_err(|e| e.to_string())?;

Testing

Running the patched neard manually confirmed that the new tracing logs do confirm that memories are properly reused most of the time. In the benchmark results, 17% of the time is also spent in the write_bytes call that manually resets the memories to 0, thus validating the hypothesis that resetting the memories takes a long time.

Running with the if self.accessible_len > 18 * 64 * 1024 (18 WASM pages) replaced with if self.accessible_len > 17 * 64 * 1024 (17 WASM pages) also results in memories not being reused. With this change, the same benchmarks show basically no difference with the control group. This confirms that the additional wasmparser runs do not result in any significant performance penalties.

Considering this testing, the benchmark results are representative of exactly the performance changes provoked by not relying on the linux kernel’s zeroing out of memory in mmap, and instead doing it ourselves.

Benchmark results

Shard 0

Shard 0 failed to reproduce blocks, thus preventing a proper benchmark comparison. This is likely due to the missing protocol version bump, and does indicate that finishing the change properly would likely result in real-world contract breakage somewhere on the shard.

Shard 1

On shard 1, performance is basically the same with and without the patch:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking
  Time (mean ± σ):      2.187 s ±  0.027 s    [User: 4.209 s, System: 0.849 s]
  Range (min … max):    2.158 s …  2.212 s    3 runs

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking
  Time (mean ± σ):      2.151 s ±  0.018 s    [User: 4.159 s, System: 0.829 s]
  Range (min … max):    2.135 s …  2.171 s    3 runs

Summary
  ./neard-reuse view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking ran
    1.02 ± 0.02 times faster than ./neard-base view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking

Shard 2

Shard 2 sees a 20% slowdown with the patch:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      5.229 s ±  0.124 s    [User: 9.260 s, System: 3.722 s]
  Range (min … max):    5.098 s …  5.344 s    3 runs

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      6.269 s ±  0.012 s    [User: 10.259 s, System: 4.591 s]
  Range (min … max):    6.259 s …  6.283 s    3 runs

Summary
  ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking ran
    1.20 ± 0.03 times faster than ./neard-reuse view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking

Shard 3

Shard 3 sees the same 20-25% slowdown with the patch:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking
  Time (mean ± σ):      9.158 s ±  0.150 s    [User: 27.688 s, System: 5.420 s]
  Range (min … max):    8.985 s …  9.252 s    3 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking
  Time (mean ± σ):     11.190 s ±  0.089 s    [User: 34.099 s, System: 7.273 s]
  Range (min … max):   11.118 s … 11.290 s    3 runs

Summary
  ./neard-base view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking ran
    1.22 ± 0.02 times faster than ./neard-reuse view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking

Shard 4

Shard 4 was like shard 0, with no results due to outcome changes that break the benchmarking setup.

Shard 5

Shard 5 even saw the patch-using binary being 35% slower than the baseline:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking
  Time (mean ± σ):      4.229 s ±  0.212 s    [User: 5.642 s, System: 2.227 s]
  Range (min … max):    4.037 s …  4.456 s    3 runs

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking
  Time (mean ± σ):      5.765 s ±  0.663 s    [User: 15.108 s, System: 3.978 s]
  Range (min … max):    5.309 s …  6.526 s    3 runs

Summary
  ./neard-base view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking ran
    1.36 ± 0.17 times faster than ./neard-reuse view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking

Ekleog-NEAR commented 4 months ago

Closing as per the above comment.

near / nearcore