calling `caputres_iter` in an `async` function may cause memory leak

ziyoung commented 1 month ago

What version of regex are you using?

Latest 1.10.5

Describe the bug at a high level.

Calling the static regular instance's caputres_iter in an async function may cause memory leak.

What are the steps to reproduce the behavior?

// complete code: https://github.com/ziyoung/memory-leak-example/blob/master/src/main.rs

static REG: Lazy<Regex> = Lazy::new(|| Regex::new(r#"(Object\.\w+)"#).unwrap());

async fn memory_leak_demo() -> Vec<u8> {
    let text = reqwest::get("https://cdnjs.cloudflare.com/ajax/libs/vue/3.4.34/vue.cjs.js")
        .await
        .unwrap()
        .text()
        .await
        .unwrap();
    let mut buf = Vec::from(text.as_bytes());
    for (i, caps) in REG.captures_iter(&text).enumerate() {
        println!("process text: {}", i + 1);
        tokio::time::sleep(Duration::from_millis(50)).await;
        let m = caps.get(0).unwrap();
        let start = m.start();
        let end = m.end();
        let text = &text[start..end];
        buf.extend(text.as_bytes());
    }
    buf
}

async fn task() {
    for _ in 0..50 {
        let buf = memory_leak_demo().await;
        let s = String::from_utf8(buf).unwrap_or_default();
        println!("string length: {}", s.len());
    }
}

#[tokio::main]
async fn main() {
    tokio::spawn(async {
        for i in 0..5 {
            task().await;
        }
    })
    .await
    .unwrap();
    println!("done");
}

What is the actual behavior?

The memory allocated during the execution of the captures_iter method was not released. I dump the memory profile file during the program running. https://github.com/ziyoung/memory-leak-example?tab=readme-ov-file#regex-memory-issue

If I change my code like follwings, there's no memory leak. https://github.com/ziyoung/memory-leak-example/pull/1/files#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fc

+ fn get_tag_positions(src: &str) -> Vec<(usize, usize)> {
+     REG
+         .clone()
+         .captures_iter(src)
+         .map(|caps| {
+             let m = caps.get(0).unwrap();
+             (m.start(), m.end())
+         })
+         .collect::<Vec<_>>()
+ }
// ...
- for (i, caps) in REG.captures_iter(&text).enumerate() {
+ for (i, (start, end)) in get_tag_positions(&text).into_iter().enumerate() {

What is the expected behavior?

There's no memory leak.

BurntSushi commented 1 month ago

The memory allocated during the execution of the captures_iter method was not released.

This doesn't imply a leak. When you run a regex, it has to allocate space to run the search. This space is tied to the Regex and it's cached in a thread safe manner. The amount of memory allocated scales proportionally to the maximum number of simultaneous requests. Not all memory allocated is kept, but some is.

When you clone the Regex, you're creating a new pool of memory that gets dropped at the end of the function.

The expectation that all memory allocated during a search gets released is just wrong. A memory leak is something that happens when memory used grows without bound given an otherwise fixed set of resources.

ziyoung commented 1 month ago

Thanks. Cloning the original regex object does indeed avoid the growth of memory usage.

-    for (i, caps) in REG.captures_iter(&text).enumerate() {
+    for (i, caps) in REG.clone().captures_iter(&text).enumerate() {

// Total: 1.3M
for (i, caps) in REG.clone().captures_iter(&text).enumerate() { // <--- 0.8M
  // ...
}

According to the regex documentation, we should indeed avoid using regex across threads. So this is not a bug. Close this issue.

BurntSushi commented 1 month ago

Using a regex across threads is fine. And cloning a regex for every search is almost certainly wrong and will lead to much much slower searches. Your peak memory usage should generally be the same in either case.

BurntSushi commented 1 month ago

And that section in the regex docs is specifically about contention on shared mutable resources leading to slower searches. Not memory usage.

ziyoung commented 1 month ago

Using a regex across threads is fine.

If I execute `memory_leak_demo` more than 2000 times, I find that the memory usage keeps growing. Suppose I am writing an API request program. For each request, I need to call the `REG.captures_iter` method, and the memory usage will keep growing. This is unacceptable.	Times	Total
500	7.5M	7M
1000	18.1M	17.6M
1500	31.7M	31.2M
2000	41.3M	40.8M
2250	51.8M	51.3M
2500	60.4M	59.9M

Is there an upper limit to the memory occupied by regex?

ziyoung commented 1 month ago

After testing, the memory usage will not keep growing when executing the code snippet below (no regex cloning here). Can we say: if you use regex.captures_iter in an async function, wrap it in a synchronous blocking function？

+ fn get_tag_positions(src: &str) -> Vec<(usize, usize)> {
+     REG
+         .captures_iter(src)
+         .map(|caps| {
+             let m = caps.get(0).unwrap();
+             (m.start(), m.end())
+         })
+         .collect::<Vec<_>>()
+ }
// ...
- for (i, caps) in REG.captures_iter(&text).enumerate() {
+ for (i, (start, end)) in get_tag_positions(&text).into_iter().enumerate() {

BurntSushi commented 1 month ago

I finally had a chance to take a quick peek at this, and your MRE is over-complicated. So I trimmed it down to this:

use std::time::Duration;

use once_cell::sync::Lazy;
use regex::Regex;

static REG: Lazy<Regex> = Lazy::new(|| Regex::new(r#"(Object\.\w+)"#).unwrap());

async fn memory_leak_demo() -> Vec<u8> {
    let text = tokio::fs::read_to_string("vue.cjs.js").await.unwrap();
    let mut buf = Vec::from(text.as_bytes());
    for (i, caps) in REG.captures_iter(&text).enumerate() {
        println!("process text: {}", i + 1);
        tokio::time::sleep(Duration::from_millis(50)).await;
        let m = caps.get(0).unwrap();
        let start = m.start();
        let end = m.end();
        let text = &text[start..end];
        buf.extend(text.as_bytes());
    }
    buf
}

async fn task() {
    for _ in 0..200_000 {
        let buf = memory_leak_demo().await;
        let s = String::from_utf8(buf).unwrap_or_default();
        println!("string length: {}", s.len());
    }
}

#[tokio::main]
async fn main() {
    tokio::spawn(async {
        task().await;
    })
    .await
    .unwrap();
    println!("done");
}

With this Cargo.toml:

[package]
name = "memory-leak-example"
version = "0.1.0"
edition = "2021"

[dependencies]
tokio = { version = "1", features = ["full", "tracing"] }
regex = "1.10.5"
once_cell = "1"
anyhow = "1"

Notice that task now has a loop that runs 200_000 times. I ran this with cargo run -r. I watched RES memory usage in htop and it remains fixed. No increases. That means there are no leaks.

Your measurement process is likely flawed. Diagnosing a memory leak is hard. But the fact that you're seeing differences between async and sync should be a tip-off that perhaps there is something about async that is causing higher peak memory usage for you. For example, if you have a bunch of tasks running simultaneously, perhaps with some paused because of your await points, then running other tasks cannot access or use the memory used by regex in the paused tasks. So those other tasks will have to create their own memory. In some cases, that memory is dropped, and in others it is put back into the pool for reuse. But invariant remains: memory usage scales proportionally with the number of simultaneous of the Regex. So if you have a lot of simultaneous uses, it could look like a leak, but it isn't.

If you want to demonstrate a leak, then please provide a simple program that I can run where the resident memory usage grows without bound while otherwise using fixed resources as reported by a process monitor like htop. The technique of running a memory profile is great for debugging leaks, but they aren't dispositive of leaks themselves.

ziyoung commented 1 month ago

I watched RES memory usage in htop and it remains fixed. No increases. That means there are no leaks.

I used htop to monitor the memory usage and it showed no significant increase. The memory profile file here misled me.

But invariant remains: memory usage scales proportionally with the number of simultaneous of the Regex.

I wrote the following code to simulate a large number of regular matches in a short period of time. Putting regular matching in a synchronous statement reduces the number of regular simultaneous matches and reduces memory usage. I think the problem that comes up in my project has something to do with this. Of course, this is not a memory leak, it is just a matter of how it is used.

use std::{env, time::Duration};

use once_cell::sync::Lazy;
use regex::Regex;
use tokio::{fs, task::JoinSet};

static REG: Lazy<Regex> = Lazy::new(|| Regex::new(r#"(Object\.\w+)"#).unwrap());

async fn memory_leak_demo() -> Vec<u8> {
    let file = fs::read("./vue.js").await.unwrap();
    let mut buf = file.clone();
    let text = String::from_utf8(file).unwrap();

    if env::var("SYNC_REG_MATCH").is_ok() {
        let positions = REG
            .captures_iter(&text)
            .map(|caps| {
                let m = caps.get(0).unwrap();
                (m.start(), m.end())
            })
            .collect::<Vec<_>>();
        for (start, end) in positions {
            tokio::time::sleep(Duration::from_millis(5)).await;
            buf.extend(text[start..end].as_bytes());
        }
    } else {
        for caps in REG.captures_iter(&text) {
            tokio::time::sleep(Duration::from_millis(5)).await;
            let m = caps.get(0).unwrap();
            let start = m.start();
            let end = m.end();
            let text = &text[start..end];
            buf.extend(text.as_bytes());
        }
    }
    buf
}

#[tokio::main]
async fn main() {
    {
        tokio::spawn(async {
            for i in 0..20 {
                let mut set = JoinSet::new();
                for _ in 0..10_000 {
                    set.spawn(memory_leak_demo());
                }
                while let Some(_res) = set.join_next().await {}
                println!("task {} done", i);
                tokio::time::sleep(Duration::from_millis(10)).await;
            }
        })
        .await
        .unwrap();
    }
    println!("use CTRL + C to exit");
    tokio::signal::ctrl_c().await.unwrap();
}

./memory-leak-example: memory usage: about 1099M normal-match

SYNC_REG_MATCH=1 ./memory-leak-example: memory usage: about 90M sync-match

BurntSushi commented 1 month ago

Yes, as I said, the likely reason here is that in your "synchronous" version, there are no await points while you're collecting the matches in your regex. So your async task cannot be stopped while you're consuming captures_iter. But in your asynchronous task, you have await points while traversing captures_iter. If a task is stopped during iteration and another starts in its place, then the second task can't use the same memory that the first task was using because it's still being borrowed by captures_iter. Once both tasks complete, the memory is either dropped or put back in the pool.

So I think the lesson here is to be careful of introducing await points while consuming a regex match iterator, because a regex match iterator has memory associated with it that can't be shared with another iterator so long as it's live.

mindreader commented 1 month ago

I've been debugging a memory "leak" that I only finally tracked down to this allocation of memory.

The application I maintain (the rust port of the matomo device detector) has tens of thousands of regexes and they must be run concurrently, but locking every individual regex is not great, and cloning before every match is not great, trying to somehow reinitialize a regex through Lazy periodically might be doable, somehow (maybe), limiting the concurrency of queries to the system is really not great, but I'm really not sure what else to do, as the memory gradually creeps up over the span of weeks.

Maybe if the cache statistics on the amount of memory stored was accessible somehow through the regex and could be cleared occasionally, somehow?

BurntSushi commented 1 month ago

@mindreader It might be worth pursuing a compaction strategy: https://github.com/rust-lang/regex/blob/ab88aa5c6824ebe7c4b4c72fe5191681783b3a68/regex-automata/src/util/pool.rs#L318-L320

Or alternatively, we could be more aggressive about throwing away scratch memory here: https://github.com/rust-lang/regex/blob/ab88aa5c6824ebe7c4b4c72fe5191681783b3a68/regex-automata/src/util/pool.rs#L607-L625

Today, we only throw memory away when there is too much contention on the shared resource. But maybe there should be other cases where we throw it away.

It seems hard to pick a correct heuristic for this. And adding a method to "clear cached memory because it might get too big over time" seems pretty unfortunate.

I'm going to re-open this and think about it a bit more.

mindreader commented 1 month ago

Thanks for pointing out the relevant code. I will definitely take a closer look.

BurntSushi commented 1 month ago

Another point to consider is that the memory increase should only occur if you're iterating over matches, and within the handling of each element there are await points. If there aren't any await points, there there isn't any opportunity for the scratch memory to be held by a task and not sent back to the pool.

This happens because memory is retrieved at the start of iteration and not put back until the iterator is dropped. This is done to avoid needing to retrieve this memory for each search. But perhaps this is the wrong trade-off to make when each iteration has an await point.

mindreader commented 1 month ago

Everything you say tracks, it shouldn't lose memory at all, but it seems to in my case. But I need to find a truly minimal test case before I can even begin to understand why.

I had no idea how much effort went into this. While I intend to experiment with this, I have little hope that I will discover something you haven't, so feel free to close this, and if I do by miracle find something workable I promise I will respond or open a PR.

BurntSushi commented 1 month ago

Do you have await points in your captures_iter loop?

mindreader commented 1 month ago

My case isn't exactly like the op, I'm not capturing over await points per se, but rather operating on single super large collection of Regex's shared via an Arc in separate threads. There are actually no awaits in code, just in the outer loop, no data shared between loops. That means that every regex should have allocated its own scratch basically, and that scratch should be reclaimed except for the "owner" thread.

But the results from the tests I'm running don't really make sense to me. I have far too many variables in my code to consider this a worthwhile bug report, please I beg, just let me get some spare time to try and remove everything that could be complicating this before taking anything I say seriously.

BurntSushi commented 1 month ago

Aye, understood. Thank you for investigating!

mindreader commented 1 month ago

I had some time this weekend to run a lot of tests and now I understand how it all works. It isn't a loss of memory, it is just that over time if your application uses a lot of different threads, you will accrue 8 full cache lines on each regex and they will never go away.

It isn't a matter of lots of concurrency, merely accessing via many threads gradually will eventually fill up the cache, even if each one is accessed one thread at a time. That's why this happens to me over a span of weeks.

The following change was enough to reduce my memory usage in real world scenarios by 50%+ while increasing cpu usage by only 15%, a tradeoff I would happily accept, except for the difficulty of exposing that level of configurability without forking.

diff --git a/regex-automata/src/util/pool.rs b/regex-automata/src/util/pool.rs
index d90d4ec..be5142e 100644
--- a/regex-automata/src/util/pool.rs
+++ b/regex-automata/src/util/pool.rs
@@ -328,7 +328,7 @@ mod inner {
     ///
     /// See this issue for more context and discussion:
     /// https://github.com/rust-lang/regex/issues/934
-    const MAX_POOL_STACKS: usize = 8;
+    const MAX_POOL_STACKS: usize = 1;

     thread_local!(
         /// A thread local used to assign an ID to a thread.
@@ -628,7 +628,7 @@ mod inner {
         /// Create a guard that represents the special owned T.
         #[inline]
         fn guard_owned(&self, caller: usize) -> PoolGuard<'_, T, F> {
-            PoolGuard { pool: self, value: Err(caller), discard: false }
+            PoolGuard { pool: self, value: Err(caller), discard: true }
         }

So we get rid of the owner thread cache because it is basically used once and never again in this scenario. And we remove all but the first cache since it will almost definitely never need more than one at a time, and I'm willing to allocate in the case that we do.

Since memory can only be reclaimed upon use of the regex, that greatly limits the ways in which you could optimize this. I thought of maybe a cache where each cache line was thread independent but it would take some real time to validate this theory and I have a feeling it would definitely hurt some benchmarks.

BurntSushi commented 3 weeks ago

I had some time this weekend to run a lot of tests and now I understand how it all works. It isn't a loss of memory, it is just that over time if your application uses a lot of different threads, you will accrue 8 full cache lines on each regex and they will never go away.

Ah I see. That's interesting. I do still wonder if number of threads has a role to play here though, since each cache line is a stack of reusable allocations. As long as there isn't any contention, these stacks can continue to grow I think. But I suppose growth is only possible when there is actual concurrency happening.

Some thoughts:

I am generally okay with exposing some kind of knob for this in the public API. For regex, I would prefer not to, of if so, something simplified. For regex-automata, I'm generally more open to exposing more knobs.
I am generally open to the idea of some kind of automatic compaction. I very specifically did not build compaction into the pool because I don't really have a target to optimize for. If someone invested the time to make an MRE of their use of regexes, I wouldn't mind investigating how compaction might help in that case.
While you may know this, it's important to note that this pool isn't just about reusing allocations. That's a part of it, but the bigger part of it is actually avoiding the recalculation of work already done. Specifically, for the lazy DFA, the transition table is populated at search time. So when you reuse that allocation, you avoid needing to re-populate transitions that have already been created.

mindreader commented 3 weeks ago

It is an interesting problem. I will give it a shot if I can get enough spare time. I can't honestly give super good odds of success.

rust-lang / regex