possible memory leak in server integrations, origins unclear

Zk2u commented 1 year ago

Context: #581

Removed all API use in valeralabs/web and hardcoded a user profile into users.rs. Running a stress test for 120 seconds:

Concurrency Level:      10
Time taken for tests:   120.001 seconds
Complete requests:      904656
Failed requests:        0
Total transferred:      4358651880 bytes
HTML transferred:       4260043940 bytes
Requests per second:    7538.75 [#/sec] (mean)
Time per request:       1.326 [ms] (mean)
Time per request:       0.133 [ms] (mean, across all concurrent requests)
Transfer rate:          35470.55 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       2
Processing:     0    1   0.5      1      23
Waiting:        0    1   0.5      1      23
Total:          0    1   0.5      1      23

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      2
  80%      2
  90%      2
  95%      2
  98%      2
  99%      3
 100%     23 (longest request)

The memory usage of the web server increased from ~5MB to 357MB, and did not decrease after the benchmark. This may be due to #529 somewhere. I'm not sure where this issue is coming from yet.

Zk2u commented 1 year ago

Tested on the hackernews example with Actix using regular serde. 3190 requests increased memory usage from 5MB to 17.5MB (again, not decreasing after).

gbj commented 1 year ago

Do you have a reason to believe this is coming from within Leptos in particular? In other words, if you run the same stress test in Actix with the same number of requests but serve some static or generated HTML instead of running it through Leptos, what are the results?

gbj commented 1 year ago

This conversation may also provide a helpful frame of reference.

See also this GitHub issue and discussion.

Zk2u commented 1 year ago

The first test calls another Actix server, our API, on every request (1:1 with leptos). The API instance remained at 4.8MB throughout the test, but the Leptos server grew to 977MB (check the latest reply on #581).

gbj commented 1 year ago

I'm happy to take a look at this locally to try to figure it out, if you can give me some steps you did to do the stress-testing. This is an area I'm pretty unfamiliar with.

Zk2u commented 1 year ago

I'm using ApacheBench - if you're on a MacBook (I think...), it should be preinstalled. The command I'm using is:

ab -n 15000 -c 10 http://127.0.0.1:3000/

Zipped test app. Run with cargo leptos watch -r

gbj commented 1 year ago

Thanks. I ran this against the todo apps to avoid spamming the 3rd-party hackernews API I use, and was able to reproduce on a smaller scale. Interestingly Axum seems to do a better job of shrinking its memory footprint back down between bursts of requests but I agree there's probably a Leptos leak somewhere.

I checked and can confirm that the Runtime structs are being dropped for each request, and the RUNTIMES slotmap capacity is staying stable, both of which which are very good (although it would have been an easy fix).

I think the same behavior appears with very simple Leptos examples based on some limiting testing. I'm not very familiar with tools to trace this kind of memory leak so I'd appreciate any help.

gbj commented 1 year ago

Okay I tested this against a route that just uses render_to_string and found completely flat memory use, which is great. So it's something inside the render_to_stream implementation, which makes perfect sense to me.

Zk2u commented 1 year ago

A similar thing may have occurred in #103 (adding for potential extra context).

gbj commented 1 year ago

A similar thing may have occurred in #103 (adding for potential extra context).

Thanks — That particular issue was before I implemented proper management for runtimes, which I've confirmed are now being disposed properly, so it's a red herring I think. But having narrowed it down to the streaming implementation is very helpful, and I can dig in a little more later.

gbj commented 1 year ago

Edited the title, if you don't mind. I was able to reproduce with this very simple main.rs, which makes it much easier for me to test.

use leptos::{ssr::render_to_stream, view, IntoView};
use futures::StreamExt;

#[tokio::main]
async fn main() {
    loop {
        let mut stream = Box::pin(render_to_stream(|cx| view! { cx,
            <div>
                <p>"Paragraph 1."</p>
                <p>"Paragraph 2."</p>
                <p>"Paragraph 3."</p>
            </div>
        }.into_view(cx)));
        while let Some(chunk) = stream.next().await {
            println!("{chunk}");
        }
    }
}

gbj commented 1 year ago

Oh no, I've been a very naughty boy.

It turns out if you write a function called run_scope_undisposed and you write in the docs, "If you do not dispose of the scope on your own, memory will leak" then... if you do not dispose of the scope on your own, memory will leak. To be honest I thought those would be cleaned up when disposing of the runtime, but apparently I thought wrong.

So, I've managed to fix the memory leak I created in my example above. I'll need to do a little more testing to make sure I've got it figured out in the Actix integration case as well. I'm out of time this morning but should be able to finish this tonight, I'd think.

Zk2u commented 1 year ago

That made me laugh. Nice job. Let me know when you have, and I’ll rerun the initial stress test to confirm.

gbj commented 1 year ago

I was actually able to reproduce the memory leak that I think is causing most of this problem.

It has to do with the interaction between spawning a large number of Tokio tasks and memory allocators not returning memory to the system.

I could toggle the memory leak on and off by rendering an app with a single create_resource, and then toggling a tokio::task::spawn_local on and off within leptos::spawn_local, which is used to load async resources.

It's entirely possible I'm using Tokio wrong here and there's an easy solution. But see also discussion in several Tokio issues:

https://github.com/tokio-rs/tokio/issues/4406

https://github.com/tokio-rs/tokio/issues/4321#issuecomment-1000237820

Zk2u commented 1 year ago

So this isn't actually a “leak”, but rather the allocator holding onto the memory after it's finished handling any requests as it's faster to reuse?

What happens when we run it in a VM with a limited amount of memory?

gbj commented 1 year ago

Nope I was wrong. It was clearly a real leak, the longer I ran it. I have managed to create a streaming, non-leaking version, so I should be able to get those changes integrated into the Actix and Axum pieces I think.

gbj commented 1 year ago

...Okay, sorry for all the back and forth here. I've managed to create several different memory leaks during my experiments here, usually due to my doing something in a relatively stupid way. (For example, my "it was clearly a real leak" above was due to a test using a loop over synchronous render_to_string, so spawning a basically infinite number of async resources outside a context in which they were being handled properly.)

Importantly: when I isolate the largest possible chunks of Leptos code from the Axum or Actix integrations—literally everything that generates the streaming streaming—and run it in a very tight loop, logging the streams of HTML out to the console, with no Actix or Axum server involved, I find that I am unable to produce any kind of memory leak. They stay at a stable and fairly low memory use (about 5MB in my Actix).

I'll merge my fix to render_to_stream, which was a real memory leak but was actually not being used directly in the integrations, I don't think. I guess I'd recommend trying your test again once I've merged that just in case. Important: I'd recommend letting it run for a long time and seeing if it ever plateaus, or if it's continuous, linear growth.

And maybe see if using jemalloc does anything to the memory use, which was recommended in several places as returning memory to the OS more aggressively.

To use jemalloc, I think you can just add it to Cargo.toml

[target.'cfg(not(target_env = "msvc"))'.dependencies]
jemallocator = "0.5"

and then in main.rs

#[cfg(not(target_env = "msvc"))]
use jemallocator::Jemalloc;

#[cfg(not(target_env = "msvc"))]
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

Beyond that, I'm not really sure what to say. I've spent a lot of time testing all the different combinations and I'm unable to produce a memory leak from within Leptos's code... If you're able to produce a reproduction of a leak that's using Leptos in a plain binary, independent of Actix or Axum, I'd be really happy to take a look, but I don't think there's anything else I can do at this point.

EDIT: Oops managed to click the wrong button and close and reopen this. Thanks, GitHub.

Zk2u commented 1 year ago

I'm going to have a go at containerising the server, limiting its memory and then stressing it and seeing if it runs out of memory or not. I will get back if this is an actual issue with leptos.

I am curious that it keeps increasing, however, since I’d expect it to plateau at some point and start reusing allocated memory. The RPS doesn't change throughout the bench in a major way.

Zk2u commented 1 year ago

I managed to containerise the server and... boom. That confirms it. There isn't a memory leak, but the allocator is not returning the memory to the system.

Again - continuous benchmark as before for 10 minutes. The web server's container was limited to using 32 MB of memory, and the server had no issues staying within that. So, I'm just marking this as done.

Edit: I'm interested in what those drops are and what controls them...

gbj commented 1 year ago

Beautiful! And thanks for raising it... I feel much more confident that we're not doing anything wrong, having worked through it all, than I did before.

leptos-rs / leptos

possible memory leak in server integrations, origins unclear #590