virtio_balloon: add support for MMIO devices

This allows the virtio memory balloon driver to work with AWS Firecracker, which instantiates MMIO devices.

Thank you for working on this!

How did you test it? I can't seem to get it to do much (though it doesn't error anymore).

I don't expect the behaviour to be the same as Linux's, but from previous experience with the ballooning device, it behaved like this:

Initially set the size of the balloon to 0
Run a the firecracker microvm
Let it allocate a bunch of memory
Inflate the balloon to reclaim memory on the host

If there's nothing preventing freeing the pages inside the microvm, then that's all there is to it. If it can't free all the requested pages, then the kernel starts logging balloon-related information. I believe it applies pressure for these pages to be freed and future allocations by the kernel are severely affected (read: impossible and everything slows down to a crawl).

Testing this branch locally, I'm unable to get any memory freed from the host's perspective. Even inflating the balloon to the total memory allowed in the firecracker doesn't appear to do much.

I'm running a Deno app and I suspect a lot of memory should be free-able. It's a little hard to tell because I'm unable to gather RSS from the host:

let mem = Deno.memoryUsage();
console.log(`rss: ${mem.rss}, heap total: ${mem.heapTotal}, heap used: ${mem.heapUsed}, external: ${mem.external}`);

rss: 0, heap total: 13991936, heap used: 12160248, external: 3003513

I'm also using the sysinfo crate in my VM and it shows quite a lot of memory used (it does memory total - memory available):

=> system:
total memory: 533139456 bytes
used memory : 413736960 bytes

These numbers are with the balloon inflated to 512MB and appears to have no effect. Inflating it to other numbers also didn't appear to change anything.

Firecracker process from the host side:

❯ cat /proc/42517/status
Name:   firecracker-v1.
Umask:  0022
State:  S (sleeping)
Tgid:   42517
Ngid:   0
Pid:    42517
PPid:   42516
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 524288
Groups: 0 
NStgid: 42517
NSpid:  42517
NSpgid: 42517
NSsid:  42516
Kthread:        0
VmPeak:   531548 kB
VmSize:   531528 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    246528 kB
VmRSS:    246528 kB
RssAnon:          244736 kB
RssFile:            1792 kB
RssShmem:              0 kB
VmData:   529132 kB
VmStk:       152 kB
VmExe:      1836 kB
VmLib:         8 kB
VmPTE:       668 kB
VmSwap:        0 kB

Somehow the memory inside the VM is reported as more than outside which feels odd to me, but I suspect there's some weirdness regarding how memory is reported from inside the guest.

As far as I understand, Deno.memoryUsage() only reports memory stats related to the running application process, and not system-wide stats, so it can't be used to detect how much memory is in the balloon. With the sysinfo crate, you can see how the memory balloon is inflated and deflated via the used_memory() function, which reports the amount of memory used system-wide (the memory in the balloon is accounted for in this value). For example, using the following Rust program:

use std::io::{self, Write};
use sysinfo::{
    System, SystemExt,
};

fn main() {
    let mut t = System::new_all();
    while true {
        t.refresh_all();
        writeln!(
            &mut io::stdout(),
            "total memory: {} KB",
            t.total_memory() / 1_000
        );
        writeln!(
            &mut io::stdout(),
            "used memory : {} KB",
            t.used_memory() / 1_000
        );
        println!("Done.");
        std::thread::sleep(std::time::Duration::from_millis(1000));
    }
}

in a Firecracker VM in 512 MB of RAM and the balloon initially empty, at the beginning I see figures such as below:

total memory: 533180 KB
used memory : 63901 KB

The value for used memory above is what is needed to run the application process. If I execute a PATCH request to set "amount_mib" to 512, I see figures such as:

total memory: 533180 KB
used memory : 464007 KB

The amount of used memory is now much higher, because most of it is taken by the balloon (and therefore cannot be used by the guest). The kernel tries to keep a "safe" amount of free memory (64 MB in this setup) so as not to risk running out of memory for its internal operations; that's why the balloon is not inflated to take all the available RAM. If I change "amount_mib" back to 0, the amount of used memory is similar to what it was initially:

total memory: 533180 KB
used memory : 67645 KB

which means that the balloon has been deflated and the memory it was holding is now available again for the guest to use.

I don't know what Firecracker does with the memory in the balloon (e.g. whether it releases the memory to the host OS, or does something else), so I don't know if we can gather any relevant information from the Firecracker process memory stats.

Ah yes, sorry, I was just including the Deno output as a data point, but wasn't exactly relying on it. I found it interesting it couldn't report on RSS at all. Is that a limitation of nanos? Might just be an incompatible implementation in Deno.

I'll have to test again on Linux, but I don't recall the balloon showing as memory usage. My own memory could be wrong though! It would make sense that it'd show as usage.

I don't know what Firecracker does with the memory in the balloon (e.g. whether it releases the memory to the host OS, or does something else), so I don't know if we can gather any relevant information from the Firecracker process memory stats.

The main purpose, in my opinion, of the balloon is that it would reclaim memory on the host. Firecracker does not reclaim allocated memory. If the guest bursts to 300 / 512MB and then goes back down to 100MB, then the host will show the process as using 300MB RSS. Inflating the balloon reclaims the now-unused memory from the host and lowers the RSS of the firecracker process.

I started my firecracker with amount_mib: 0 and I saw this:

=> system:
total memory: 533139456 bytes
used memory : 239341568 bytes

Then I inflated the balloon to 300 MiB and saw:

=> system:
total memory: 533139456 bytes
used memory : 433418240 bytes

Then I deflated it back to 0 and saw:

=> system:
total memory: 533139456 bytes
used memory : 282423296 bytes

The host's RSS for firecracker did not move (or it did by maybe 1-3MB) through these various operations.

I short-circuited my program to allocate 200MB of random bytes of memory (allocating zeroes did nothing for some reason, maybe as an optimization in nanos or in Rust?) and I got the following:

=> system:
total memory: 533139456 bytes
used memory : 91029504 bytes

allocated 209715200 bytes

=> system:
total memory: 533139456 bytes
used memory : 300630016 bytes

Looking from the host, it did use ~238MB of RSS. I don't know why that figure would be smaller than what's reported inside the guest, but I can ignore that for now :)

After 30 seconds, I deallocated the Vec<u8> and saw this:

deallocated

=> system:
total memory: 533139456 bytes
used memory : 90910720 bytes

So far so good!

As expected, by my own mental model of Firecracker, the RSS for the process on the host remained stable at 238MB.

I then inflated the balloon to 400MiB and saw:

=> system:
total memory: 533139456 bytes
used memory : 456646656 bytes

Which would fit what you've said in your last comment.

The host has reclaimed the memory and is sitting at around 57MB. Again, that figure is a bit surprising to me because it's lower than what the guest reports.

Now, given all of that, it seems like memory ballooning does work as expected! My bad.

This is leading me to think that the base memory used by my own app is high and unreclaimable by the balloon device (and I assume: the kernel). It seems like running the same Deno app on then host uses roughly 128MB (RSS). So maybe the 90MB based I got in my program + 128MB sounds about right. I was hoping my own program would be closer to 128MB and have practically zero overhead.

My program is 103MB and deno compiled with the same flags is 123MB. In theory I could be using less memory inside the guest, but in practice I am not and there's overhead unaccounted for (likely on my end).

Edit: Running my app on my Linux host directly uses only about 128MB of RSS memory as well. In line with Deno. At this point i'm fairly sure it's an interaction with the nanos kernel. I can open a separate issue for that. The only difference I can think of the 3MB of lib/ files needed to run my app, but I figure running it on the host would also load those in memory and 3MB is a small number anyway.

My app image built with ops is 123MB.

Do you have any idea why RSS reported from the host for the firecracker process could be lower than the memory reported inside the host? In our own experience, that's never the case.

Ah yes, sorry, I was just including the Deno output as a data point, but wasn't exactly relying on it. I found it interesting it couldn't report on RSS at all. Is that a limitation of nanos? Might just be an incompatible implementation in Deno.

Deno retrieves the RSS from the /proc/self/statm file, which doesn't exist in Nanos, that's why the reported RSS value is 0.

I short-circuited my program to allocate 200MB of random bytes of memory (allocating zeroes did nothing for some reason, maybe as an optimization in nanos or in Rust?)

Usually when a program allocates a large amount of memory, what happens under the hood is that an mmap() syscall is invoked, which allocates the requested amount of virtual address space but does not map it directly to physical RAM, so RAM space is not allocated right away; the actual allocation (which shows up in the used memory) happens only when the program accesses the memory (in your case, when the memory is filled with random values). Allocating zeros does not require accessing the memory (because mmap()ed memory is always initialized with zeros), that's why it doesn't show up in the used memory.

Do you have any idea why RSS reported from the host for the firecracker process could be lower than the memory reported inside the host?

Nanos internally allocates guest physical memory for many different purposes, but not all of that memory may be used immediately, and only the memory that has actually been used (i.e. read from or written to) shows up in Firecracker's RSS. For example, Nanos has internal heaps that use different caches for different allocation sizes; if e.g. one 32-byte allocation is made inside the kernel, the relevant cache allocates a 2-MB region of physical memory, but only a fraction of that memory is going to be used immediately: in this scenario, Nanos reports the entire 2 MB as in-use memory (because it has been allocated in the guest physical memory and is therefore unavailable for other uses), while only the fraction that's actually used (which could be as low as 4 KB) shows up in Firecracker's RSS. Another example is the set of buffers pre-allocated by the network interface driver for reception of network packets: all the memory occupied by the buffers is considered in-use memory by Nanos, but only the buffers that have actually been used for receiving incoming network packets show up in the RSS.

I'm not sure I fully understood the issue you mentioned when comparing the RSS of your app when run directly on the host with the RSS of Firecracker when running your app under Nanos, but it seems to me you are considering the used memory reported by the guest (e.g. 90 MB) as separate from Firecracker's RSS, and you are adding the two figures. In reality, the two figures refer to the same memory (if we ignore the additional memory needed by the VMM itself to create and run the guest VM), and their differences can be explained with what I wrote above. (That's when the memory balloon is empty; if you inflate the balloon, its memory is accounted for in the guest used memory but not in Firecracker's RSS, of course.) There would be an issue if the RSS when running your app under Nanos was substantially higher than the RSS of your app when run directly on the host, but as long as that doesn't happen I think we are seeing the expected behavior.

Thanks for the explanations. This is insightful.

I'm not sure I fully understood the issue you mentioned when comparing the RSS of your app when run directly on the host with the RSS of Firecracker when running your app under Nanos, but it seems to me you are considering the used memory reported by the guest (e.g. 90 MB) as separate from Firecracker's RSS, and you are adding the two figures.

I'm essentially trying to measure the overhead of nanos. Firecracker should using using a few MBs (6?) and my app uses ~120MB RSS, so I'm trying to account for the ~100MB extra I'm seeing from running my app with the nanos kernel.

I haven't yet had a chance to test this using Linux, but i suspect there's overhead there as well. I just don't expect it to be that much.

By using the balloon my hope was to reclaim as much as possible. It seems like there's more unreclaimable memory when using Deno vs. just allocating 200MB of random bytes on the heap in a Rust program. I'm saying this because of my test comparing the RSS of firecracker before and after allocating and before and after inflating the balloon.

I would expect to be able to reclaim a lot more memory on the host for an app that uses ~120MB of RSS.

From your explanation, it sounds like there's some form of allocation that's not being deallocated when the balloon expands.

The reason I am specifically testing for low memory footprint is that I want to fit as many of these as possible on a host. The difference between 128MB and 200MB+ of memory allocated for a firecracker process is big, almost halving the number that can fit on a host.

Another example is the set of buffers pre-allocated by the network interface driver for reception of network packets: all the memory occupied by the buffers is considered in-use memory by Nanos, but only the buffers that have actually been used for receiving incoming network packets show up in the RSS.

My app should only have 1 TCP listener and I'm running Deno with --cache-only to prevent downloading packages at runtime.

I did notice that making a single request will increase the memory usage by a little bit.

is there a way to inspect what "kinds" of allocations (and how much) nanos is doing, at runtime?

My confusion comes from the fact that a test program's memory, allocating 200MB of random bytes, can be fully reclaimed when using the balloon device. But Deno's or V8's memory can't be reclaimed by the host in the same way. So I imagine it has something to do with the "kinds" of allocations it is doing.

On Linux, I suspect the balloon creates memory pressure and the kernel drops buffers and caches more aggressively. I'm wondering if nanos is not aggressive enough in this case. If I am applying memory pressure, then I expect to reclaim as much as possible, even at the cost of performance, if only temporarily.

There would be an issue if the RSS when running your app under Nanos was substantially higher than the RSS of your app when run directly on the host, but as long as that doesn't happen I think we are seeing the expected behavior.

How do I measure that? The RSS stat is not available from inside.

As far as I can tell, there's roughly 2x overhead in memory by running my app in a Firecracker with the nanos kernel. The RSS of the firecracker process is what I care about because it's what the host sees and uses to determine used / available memory. If the firecracker RSS is artificially too high, then a host can come into an OOM situation (though I wouldn't let it go that far since I'd be tracking RSS).

is there a way to inspect what "kinds" of allocations (and how much) nanos is doing, at runtime?

You can get memory-related info from the guest via the virtio-balloon statistics (see https://github.com/firecracker-microvm/firecracker/blob/main/docs/ballooning.md#virtio-balloon-statistics). To get more detailed info on kernel internals, we have the in-kernel management interface that implements a telnet server that can be queried to retrieve various data, including memory heap statistics; but this interface is not enabled in production kernels, needs to be enabled explicitly by rebuilding the kernel with MANAGEMENT=telnet, and in order to make sense of the info you get from it you would need advanced knowledge of kernel internals.

How do I measure that? The RSS stat is not available from inside.

I was referring to the RSS of the firecracker process, not the RSS of the deno process inside the guest.

So it seems you are seeing ~100MB extra RSS in firecracker compared to running your deno app directly on the host; and it seems you are unable to reclaim that memory via the virtio-ballon device. This sounds like something that could be improved. If I run the simple app in main.ts at https://github.com/nanovms/nanos/issues/2080#issue-2677006591, I see roughly 25 MB of difference between the two RSS values (52 MB when running the app in the host, vs 77 MB when running it with Firecracker), and that seems pretty reasonable, given that we have to account for the memory needed by Nanos to operate the guest VM, as well as the memory needed by the Firecracker process itself. Do you have sample code that shows the ~100 MB unreclaimable overhead when run with Nanos and Firecracker? If we can replicate this behavior, we can take a look and see what can be done about it.

If I run the simple app in main.ts at #2080 (comment), I see roughly 25 MB of difference between the two RSS values (52 MB when running the app in the host, vs 77 MB when running it with Firecracker), and that seems pretty reasonable, given that we have to account for the memory needed by Nanos to operate the guest VM, as well as the memory needed by the Firecracker process itself. Do you have sample code that shows the ~100 MB unreclaimable overhead when run with Nanos and Firecracker? If we can replicate this behavior, we can take a look and see what can be done about it.

That does seem reasonable, yes.

I'll try and reproduce the memory bloat with my simple app.

The one that does appear to be problematic is a fresh boilerplate app. Basically:

deno run -A -r https://fresh.deno.dev

Then, if you're using deno compile (btw, I am using Deno 2.x for this), then you can point it at the main.ts file as an entrypoint.

Edit: I couldn't reproduce high memory usage with the simple JS app which leads me to think there's something about how memory is allocated with the more complex app.

nanovms / nanos

virtio_balloon: add support for MMIO devices #2081