rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.51k stars 440 forks source link

parser uses a surprising amount of memory for simple but long patterns #1090

Closed lilydjwg closed 1 year ago

lilydjwg commented 1 year ago

What version of regex are you using?

1.9.5

Describe the bug at a high level.

I compiled a regex with a lot of literals as alternatives (~700KiB) and it consumed way too much memory (256MiB).

What are the steps to reproduce the behavior?

Prepare a text file with content like abc|def|text|... but about 700KiB long and ~64K |s (it could be generated from a dict file).

fn main() {
  let s = std::fs::read_to_string("file").unwrap();
  let r = regex::Regex::new(&s);
  std::thread::sleep(std::time::Duration::from_secs(100));
  drop(r);
}

And then look at the process's memory usage.

What is the actual behavior?

It took up about 256MiB memory. Python only used 31MiB memory.

What is the expected behavior?

I expect it to use less memory than Python, not a lot more.

BurntSushi commented 1 year ago

I need a reproduction please. The code you've provided is not runnable by me. You can upload data files to GitHub issues.

lilydjwg commented 1 year ago

I generated one. This is no the same as the one I've been using but has the same issue (gunzip first).

file.gz

BurntSushi commented 1 year ago

Okay, there are some knots to untangle here. First is that you aren't checking whether regex compilation succeeds or not. In the case you've given me, it does not, because it exceeds the default size limit. The way this works is that the regex will try to eagerly be compiled, and if during the course of compilation it exceeds the limit, compilation is stopped and an error is returned. That means that if you increase the limit and let the regex actually compile, it's likely to use more memory than what you're observing. We should also try to actually use the regex to check that it actually works.

So let's do that first. I'll use my shell's built-in time command to measure peak memory usage. I've also inserted some print and timing statements to get a sense of where the program is spending time:

use std::time::{Duration, Instant};

fn main() {
    env_logger::init();
    let (s, elapsed) = timeit(|| std::fs::read_to_string("file").unwrap());
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    println!("regex compiled: {:?}", elapsed);

    let (m, elapsed) = timeit(|| {
        r.find("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
            .expect("a regex match")
    });
    println!("search finished: {:?}", elapsed);
    assert_eq!(45, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And to compile and run it:

$ cargo b --release
    Finished release [optimized] target(s) in 0.00s

$ time ./target/release/i1090
pattern read from file: 342.791µs
regex compiled: 220.353665ms
search finished: 96.698486ms

real    0.326
user    0.249
sys     0.076
maxmem  383 MB
faults  0

All right, now we need a baseline to compare it to. You made a claim about the memory usage being greater than Python, but you didn't provide an equivalent Python program. Thankfully I know Python and how to run Python programs, so I'll do that part for you:

import re
import time

def main():
    s, elapsed = timeit(lambda: open('file', 'r').read())
    print(f"pattern read from file: {elapsed}s")

    r, elapsed = timeit(lambda: re.compile(s))
    print(f"regex compiled: {elapsed}s")

    m, elapsed = timeit(
        lambda: r.search("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
    )
    print(f"search finished: {elapsed}s")
    assert m.start() == 45

def timeit(f):
    start = time.time()
    t = f()
    elapsed = time.time() - start
    return t, elapsed

if __name__ == '__main__':
    main()

And to run it:

$ time python main.py
pattern read from file: 0.0006263256072998047s
regex compiled: 0.7324204444885254s
search finished: 0.00032448768615722656s

real    0.744
user    0.734
sys     0.010
maxmem  131 MB
faults  0

So the Python program uses less memory than the Rust program, but the difference is not as great as you mentioned.

Generally speaking, this is not unexpected. Python uses a backtracking regex engine, and so I would expect it to use less memory overall. The regex crate uses finite automata to guarantee linear time, and this generally leads to greater memory usage.

So actually looking at your input, you do not have an alternation of literals. Most of your alternates are literals, but some are regexes due to the presence of the . meta-character. So let's put each literal on its own line, read them one line at a time, escape them and then join them together. I'll also note that you had a trailing | in your file, which matches the empty pattern. I would guess that was unintended so I removed it. Here are the updated programs and timings. For Rust:

use std::time::{Duration, Instant};

fn main() {
    let (s, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(regex::escape)
            .collect::<Vec<String>>()
            .join("|")
    });
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    println!("regex compiled: {:?}", elapsed);

    let (m, elapsed) = timeit(|| {
        r.find("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
            .expect("a regex match")
    });
    println!("search finished: {:?}", elapsed);
    assert_eq!(45, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And Python:

import re
import time

def main():
    s, elapsed = timeit(
        lambda: '|'.join(map(re.escape, open('file', 'r').read().splitlines()))
    )
    print(f"pattern read from file: {elapsed}s")

    r, elapsed = timeit(lambda: re.compile(s))
    print(f"regex compiled: {elapsed}s")

    m, elapsed = timeit(
        lambda: r.search("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
    )
    print(f"search finished: {elapsed}s")
    assert m.start() == 45

def timeit(f):
    start = time.time()
    t = f()
    elapsed = time.time() - start
    return t, elapsed

if __name__ == '__main__':
    main()

And now to run the programs:

$ cargo b --release

$ time ./target/release/i1090
pattern read from file: 7.74721ms
regex compiled: 203.093996ms
search finished: 3.709µs

real    0.222
user    0.152
sys     0.069
maxmem  322 MB
faults  0

$ time python main.py
pattern read from file: 0.0295107364654541s
regex compiled: 0.7328767776489258s
search finished: 0.0003745555877685547s

real    0.774
user    0.720
sys     0.053
maxmem  131 MB
faults  0

Memory usage for the Rust program dropped a little bit, but not by much. However, the search time dropped quite a bit. This is because the regex engine is no longer compiling a regex, but rather, just building an Aho-Corasick automaton. It it likely also using Aho-Corasick.

To check whether memory usage is as I would expect, I ran the Rust program above under heaptrack. It turns out that nearly all of the memory is being used... by the regex parser. Yikes. That's not so good. This means memory usage has nothing to do with the actual compiled regex object itself, but its Ast.

I'll leave this bug open for now to see if I can investigate it, but if you really only need to look for an alternation of literals, you can just use the aho-corasick crate directly:

use std::time::{Duration, Instant};

fn main() {
    env_logger::init();
    let (patterns, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(|line| line.to_string())
            .collect::<Vec<String>>()
    });
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        aho_corasick::AhoCorasick::builder()
            .match_kind(aho_corasick::MatchKind::LeftmostFirst)
            .build(&patterns)
            .unwrap()
    });
    println!("aho-corasick compiled: {:?}", elapsed);

    let haystack = format!("{}zygote", "!".repeat(10_000));
    let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
    println!("search finished: {:?}", elapsed);
    assert_eq!(10_000, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And now to compile and run it, we can see memory usage is way lower, including lower than Python:

$ cargo b --release
   Compiling i1090 v0.1.0 (/home/andrew/tmp/issues/regex/i1090)
    Finished release [optimized + debuginfo] target(s) in 0.65s

$ time ./target/release/i1090
pattern read from file: 4.236067ms
aho-corasick compiled: 19.070214ms
search finished: 12.952µs

real    0.025
user    0.025
sys     0.000
maxmem  12 MB
faults  0
lilydjwg commented 1 year ago

I'm sorry I left some issues in the example program, but they weren't present in my actual program (in which I escaped meta-characters, got the size_limit error, raised it, and it worked correctly).

Btw I didn't measure max memory. Instead I looked at the resident memory during long time run.

I've switched to aho-corasick. It took me a few attempts before it worked as expected, and yes, memory usage is low now.

Glad to hear the heaptrack tool. It seems very useful. Thank you.

BurntSushi commented 1 year ago

Btw I didn't measure max memory. Instead I looked at the resident memory during long time run.

Aye. This is the sort of thing that should have been in the original issue. You need to provide the full context for not only how you're producing a problem but how you're measuring it. Notice how I did that in my comment and it enabled you to see a discrepancy between our measurement methods!

BurntSushi commented 1 year ago

Sadly, this looks a little tricky to fix, even with a band-aide. One possible band-aide is to introduce some boxing inside the Ast types. Namely, an Ast is over 200 bytes big on my x86-64 target. Since every single codepoint in the pattern gets its own Ast, it follows that every codepoint is using over 200 bytes of memory. (The pattern is about 1,000,000 codepoints. If each codepoint uses about 200 bytes, then this definitely explains where the vast majority of the memory being used comes from.) This could be reduced substantially with a few well-placed Boxes, but unfortunately, the Ast type is completely exposed in the public API of regex-syntax. Doing a breaking change of regex-syntax is not a huge deal, but I just did one and don't have the energy to do another just yet. So at some point, I should be able to mostly mitigate in the regex-syntax 0.8.0 release, whenever that is.

It's not clear to me if more fixes will be needed beyond that. If so, it might require deeper changes to the parser or perhaps even a "fast" path. But if we can get the Ast down to 24 bytes or so via boxing, then that would result in approximately an order of magnitude improvement. That's probably good enough.

lilydjwg commented 1 year ago

You need to provide the full context for not only how you're producing a problem but how you're measuring it.

I'll do next time. (I thought I meant looking at whatever *top / process monitor program. Probably I shouldn't have written the issue after I hurried up coding and debugging.)

I have another question: why doesn't the parser free the memory after compilation is done? Such a large memory peak is still an issue for a router, but at least not so much for a desktop / server.

BurntSushi commented 1 year ago

It does. The issue is that the looking at memory usage via top (i.e., RSS as reported by the OS) is a flawed measurement. Just because this library frees the memory doesn't mean the allocator frees it back to the OS. The allocator might keep it around for other uses.

You should be able to confirm this by inserting a shim via a global allocator to report total memory retained. (I have such a shim, and I will share it next time I have my hands on a keyboard. I'm on mobile at the moment.)

BurntSushi commented 1 year ago

Also, if someone is using this library in a constrained environment such as a router, I would be happy to hear more detailed reports of issues that one runs into.

lilydjwg commented 1 year ago

Just because this library frees the memory doesn't mean the allocator frees it back to the OS.

I see. It's a pity though. This may be a good reason to have a swap space so this kind of never-going-to-be-used-again memory can be swapped out to make room for other use.

if someone is using this library in a constrained environment such as a router.

Sorry I don't have my own router, and don't feel good to mess up with the ISP one. People asked about using my program on their routers though and that's one of the reasons I switch from Python to Rust.

BurntSushi commented 1 year ago

I haven't put swap on any of my machines for over a decade.

And yes, Rust isn't magic. Python's regex engine is written in C. And it's a backtracker. So its memory usage is in principle not going to be obviously worse than any other regex engine written in Rust or C.

BurntSushi commented 1 year ago

Here is a program with the allocator shim:

use std::time::{Duration, Instant};

fn main() {
    let (s, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(regex::escape)
            .collect::<Vec<String>>()
            .join("|")
    });
    println!("pattern read from file: {:?}", elapsed);

    allocator::reset();
    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    allocator::tell();
    println!("regex compiled: {:?}", elapsed);

    let haystack = format!("{}zygote", "!".repeat(10_000));
    let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
    println!("search finished: {:?}", elapsed);
    assert_eq!(10_000, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

mod allocator {
    use std::alloc::{GlobalAlloc, Layout, System};
    use std::sync::atomic::{AtomicUsize, Ordering};

    static TOTAL_ALLOC: AtomicUsize = AtomicUsize::new(0);
    static TOTAL_DEALLOC: AtomicUsize = AtomicUsize::new(0);

    #[global_allocator]
    static ALLOCATOR: TotalAllocator = TotalAllocator;

    struct TotalAllocator;

    unsafe impl GlobalAlloc for TotalAllocator {
        unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
            TOTAL_ALLOC.fetch_add(layout.size(), Ordering::Relaxed);
            System.alloc(layout)
        }

        unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
            TOTAL_DEALLOC.fetch_add(layout.size(), Ordering::Relaxed);
            System.dealloc(ptr, layout)
        }
    }

    pub(crate) fn reset() {
        TOTAL_ALLOC.store(0, Ordering::Relaxed);
        TOTAL_DEALLOC.store(0, Ordering::Relaxed);
    }

    pub(crate) fn tell() {
        let total_alloc = TOTAL_ALLOC.load(Ordering::Relaxed) as isize;
        let total_dealloc = TOTAL_DEALLOC.load(Ordering::Relaxed) as isize;
        eprintln!(
            "allocated: {}, retained: {}",
            total_alloc,
            total_alloc - total_dealloc,
        );
    }
}

And then compile and run it:

$ cargo b --release
   Compiling i1090 v0.1.0 (/home/andrew/tmp/issues/regex/i1090)
    Finished release [optimized + debuginfo] target(s) in 0.37s

$ time ./target/release/i1090
pattern read from file: 8.313858ms
allocated: 683401785, retained: 1451268
regex compiled: 205.86363ms
search finished: 25.806µs

real    0.221
user    0.147
sys     0.073
maxmem  333 MB
faults  0

So there is a lot being allocated overall, but only 1.45MB is retained.

BurntSushi commented 1 year ago

OK, I've got memory usage down below the Python program. Specifically, this Rust program:

use std::time::{Duration, Instant};

fn main() {
    let (s, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(regex::escape)
            .collect::<Vec<String>>()
            .join("|")
    });
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    println!("regex compiled: {:?}", elapsed);

    let haystack = format!("{}zygote", "!".repeat(10_000_000));
    let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
    println!("search finished: {:?}", elapsed);
    assert_eq!(10_000_000, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And this Python program:

import re
import time

def main():
    s, elapsed = timeit(
        lambda: '|'.join(map(re.escape, open('file', 'r').read().splitlines()))
    )
    print(f"pattern read from file: {elapsed}s")

    r, elapsed = timeit(lambda: re.compile(s))
    print(f"regex compiled: {elapsed}s")

    haystack = ('!' * 10_000_000) + 'zygote'
    m, elapsed = timeit(lambda: r.search(haystack))
    print(f"search finished: {elapsed}s")
    assert m.start() == 10_000_000

def timeit(f):
    start = time.time()
    t = f()
    elapsed = time.time() - start
    return t, elapsed

if __name__ == '__main__':
    main()

with the data given by the OP:

$ cargo b --release

$ time ./target/release/i1090
pattern read from file: 8.23772ms
allocated: 234110697, retained: 1451268
regex compiled: 151.386156ms
search finished: 12.125789ms

real    0.179
user    0.143
sys     0.036
maxmem  122 MB
faults  0

$ time python main.py
pattern read from file: 0.03054189682006836s
regex compiled: 0.7595720291137695s
search finished: 0.01816868782043457s

real    0.823
user    0.772
sys     0.050
maxmem  131 MB
faults  0

Note that I'm not 100% sure when this fix is going to be released since it will require putting out a regex-syntax 0.8 release, which is a breaking change from the current 0.7. I might do it soon depending on how some other experiments go.