nmanthey commented 4 years ago

Huge pages bring performance benefits for memory intensive applications. A simple way to use huge pages is by using transparent huge pages. This can be done by either using statically pre-reserved huge pages, or by using transparent huge pages.

While some distributions enable transparent huge pages by default, other distributions chose to allow this feature only when the madvise system call is used.

This change adds the madvise system call to the memory allocation. On Unix systems, the invocation of mmap is followed with an madvise system call that asks the kernel to back the memory with transparent huge pages, if possible.

Note: this is a prototypical implementation of getting huge page support. No performance testing has been performed yet. We expect similar results as reported in https://arxiv.org/abs/2004.14378 Once this data is available, a configuration layer should be added to be able to disable or enable this change.

Signed-off-by: Norbert Manthey nmanthey@amazon.de

Testing Done

Currently, I only compiled firecracker with the change, and will run it's test suite next and add the results here, as well as look into performance testing.

The test suite mostly passes, except some tests that fail due to time outs, which might depend on me using a mobile CPU for testing.

Update dependencies

firecracker$ cargo update
    Updating crates.io index
    Updating git repository `https://github.com/firecracker-microvm/kvm-bindings`
    Updating git repository `https://github.com/firecracker-microvm/kvm-ioctls`
...
    Updating unicode-xid v0.2.0 -> v0.2.1
      Adding vm-memory v0.2.2 ($HOME/firecracker/local-crates/vm-memory)
    Removing vm-memory v0.2.2
    Updating winapi v0.3.8 -> v0.3.9

Actually Build

firecracker$ tools/devtool build |& tee build.log
[Firecracker devtool] Starting build (debug, musl) ...
 Downloading crates ...
...
   Compiling vmm-sys-util v0.6.1
   Compiling vm-memory v0.2.2 (/firecracker/local-crates/vm-memory)
   Compiling timerfd v1.1.1
...
    Finished dev [unoptimized + debuginfo] target(s) in 38.89s
[Firecracker devtool] Build successful.
[Firecracker devtool] Binaries placed under $HOME/firecracker/build/cargo_target/x86_64-unknown-linux-musl/debug

Required Changes in Firecracker (to pick up this change)

Place this vm-memory repository into "local-crates/vm-memory" in the firecracker repository.

In firecracker, use the following changes on top of the (current) master commit:

firecracker$ git diff HEAD^^
diff --git a/Cargo.toml b/Cargo.toml
index 799712f7..88da28f0 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,6 +1,9 @@
 [workspace]
 members = ["src/firecracker", "src/jailer"]

+[patch.crates-io]
+vm-memory = { path = 'local-crates/vm-memory' }
+
 [profile.dev]
 panic = "abort"

diff --git a/tools/devtool b/tools/devtool
index 2c2a9551..d870cef5 100755
--- a/tools/devtool
+++ b/tools/devtool
@@ -367,6 +367,7 @@ run_devctr() {
     # Use 'z' on the --volume parameter for docker to automatically relabel the
     # content and allow sharing between containers.
     docker run "${docker_args[@]}" \
+        --network=host \
         --rm \
         --volume /dev:/dev \
         --volume "$FC_ROOT_DIR:$CTR_FC_ROOT_DIR:z" \

jiangliu commented 4 years ago

We have done the same optimization out of the vm-memory crate. It's really tricky to deal with preallocation and transparent huge pages. How about left this part to the vmm implementation?

nmanthey commented 4 years ago

out of the vm-memory crate

Can you point to your implementation of this? I'd argue being able to use this in a configurable way right in the crate might be easier to adapt for other consumers.

Can you also propose benchmarks you performed to see the effect of this?

jiangliu commented 4 years ago

out of the vm-memory crate

Can you point to your implementation of this? I'd argue being able to use this in a configurable way right in the crate might be easier to adapt for other consumers.

Can you also propose benchmarks you performed to see the effect of this?

We have done some benchmarks with internal workload, and the improvements depends on several factors. For most cases it could achieve 10-20% improvement for latency.

alexandruag commented 4 years ago

Thanks for the details! It seems like there are some trade-offs to consider when using THP, so it shouldn't always be enabled. Does that sound right? In that case, since the madvise call happens separately from creating the mapping, it seems the VMM invoking madvise out-of-band like Gerry suggested is definitely an option. Another one is to add a method to GuestRegionMmap (and potentially a top level one to GuestMemoryMmap too) which invokes madvise after the regions have been created. Would be interesting to hear more thoughts here as well.

nmanthey commented 3 years ago

I added a cherry-picked commit from my current development tree that is based on v0.3.0 of this package. To not move the PR around, I decided to just add the change on top.

The change introduced options in a brief way, without properly forwarding the options to the actual user of the created maps. Do you suggest to continue the effort and lift the options struct to the signature of all relevant functions, so that a caller can always choose how to set them? That seems to result in a lot of code (and interface) changes.

Alternatives I can see would be to have a singleton wrt properties, or have something else that is global, and allows to control the behavior, like environment variables.

I currently prefer a singleton, any comments?

Wrt #120, can we see the change of vmm to use huge pages. I guess, that change does not support using transparteng huge pages, does it?

alexandruag commented 3 years ago

Hmm, seems like there are several proposed changes in flight that would benefit from a better interface for building GuestMemoryMmap objects (i.e. this PR, #120, #125, not to mention stuff like adding more flexibility when creating file backed mappings). However, I don't think it's super clear at the moment what the right solution is. How about we provide the THP related functionality as part of a method that can be call from the outside for now, and later add an option when revamping the higher level interfaces?

From what I understand, #120 is sort of orthogonal to the transparent huge pages aspect (i.e. regions with THP are not considered to be backed by huge pages w.r.t. the semantics in there), but I might be wrong and hopefully some1 will correct me if that's the case.

rust-vmm / vm-memory

mmap: try THP via madvise #113

Testing Done

Update dependencies

Actually Build

Required Changes in Firecracker (to pick up this change)