rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.53k stars 440 forks source link

1.9 memory usage: `globset`-generated RegexSet allocates and retains 48× more memory (600MB) vs regex 1.8 #1059

Closed dtolnay closed 1 year ago

dtolnay commented 1 year ago

What version of regex are you using?

1.9.1

Describe the bug at a high level.

I am investigating a recent large memory usage regression in Buck2 that bisected to a regex update from 1.8.4 to 1.9.0, and persists in 1.9.1.

There are some long-lived GlobSet objects in Buck2 corresponding to filepath patterns that buck has been configured to ignore. The real-world code is here: https://github.com/facebook/buck2/blob/4d300b08ec0c742952f4cdd9abf1c1055c7a75ab/app/buck2_common/src/ignores/ignore_set.rs.

I instrumented the RegexSet created by the globset crate, and was able to create the following repro that shows +600MB being allocated and retained by the RegexSet implementation for a seemingly pretty reasonably sized globset.

What are the steps to reproduce the behavior?

cargo run --release:

// [dependencies]
// regex = "=1.9.1"  # or "=1.8.4"

use regex::bytes::RegexSet;
use std::thread;

const THREADS: usize = 80;

const REGEXES: &[&str] = &[
    "(?-u)^(buck\\-out/.*|buck\\-out)$",
    "(?-u)^(\\.buckd/.*|\\.buckd)$",
    "(?-u)^(\\.git/.*|\\.git)$",
    "(?-u)^(\\.hg/.*|\\.hg)$",
    "(?-u)^(\\.hgmonitor\\.status/.*|\\.hgmonitor\\.status)$",
    "(?-u)^(\\.idea/.*|\\.idea)$",
    "(?-u)^(\\.focus\\-android\\.buckd/.*|\\.focus\\-android\\.buckd)$",
    "(?-u)^(\\.focus\\-android\\-buck\\-out/.*|\\.focus\\-android\\-buck\\-out)$",
    "(?-u)^(\\.lsp\\.buckd/.*|\\.lsp\\.buckd)$",
    "(?-u)^\\.cpplsp[^/]*\\-buck\\-out$",
    "(?-u)^(\\.lsp\\-buck\\-out/.*|\\.lsp\\-buck\\-out)$",
    "(?-u)^(\\.ovrsource\\-rest/.*|\\.ovrsource\\-rest)$",
    "(?-u)^cpplsp[^/]*\\-buck\\-out$",
    "(?-u)^(lsp\\-buck\\-out/.*|lsp\\-buck\\-out)$",
    "(?-u)^(lsp\\.buckd/.*|lsp\\.buckd)$",
    "(?-u)^(buck\\-out/.*|buck\\-out)$",
    "(?-u)^(arvr/tools/build_defs/config/.*|arvr/tools/build_defs/config)$",
    "(?-u)^(arvr\\-legacy/.*|arvr\\-legacy)$",
    "(?-u)^(fbandroid/buck\\-out/.*|fbandroid/buck\\-out)$",
    "(?-u)^(fbandroid/\\.lsp\\.buckd/.*|fbandroid/\\.lsp\\.buckd)$",
    "(?-u)^fbandroid/\\.cpplsp[^/]*\\-buck\\-out$",
    "(?-u)^(fbandroid/\\.lsp\\-buck\\-out/.*|fbandroid/\\.lsp\\-buck\\-out)$",
    "(?-u)^(fbandroid/java/com/facebook/catalyst/js/node_modules/.*|fbandroid/java/com/facebook/catalyst/js/node_modules)$",
    "(?-u)^(fbandroid/libraries/fresco/fbcore/.*|fbandroid/libraries/fresco/fbcore)$",
    "(?-u)^(fbandroid/out/lint/.*|fbandroid/out/lint)$",
    "(?-u)^(fbcode/buck\\-out/.*|fbcode/buck\\-out)$",
    "(?-u)^(fbcode/\\.lsp\\.buckd/.*|fbcode/\\.lsp\\.buckd)$",
    "(?-u)^fbcode/\\.cpplsp[^/]*\\-buck\\-out$",
    "(?-u)^(fbcode/\\.lsp\\-buck\\-out/.*|fbcode/\\.lsp\\-buck\\-out)$",
    "(?-u)^(fbobjc/buck\\-out/.*|fbobjc/buck\\-out)$",
    "(?-u)^(fbobjc/\\.lsp\\.buckd/.*|fbobjc/\\.lsp\\.buckd)$",
    "(?-u)^fbobjc/\\.cpplsp[^/]*\\-buck\\-out$",
    "(?-u)^(fbobjc/\\.lsp\\-buck\\-out/.*|fbobjc/\\.lsp\\-buck\\-out)$",
    "(?-u)^(node_modules/.*|node_modules)$",
    "(?-u)^(ovrsource\\-legacy/.*|ovrsource\\-legacy)$",
    "(?-u)^(arvr/projects/rocktenn/libraries/ARCHIVE/.*|arvr/projects/rocktenn/libraries/ARCHIVE)$",
    "(?-u)^(arvr/projects/rocktenn/projects/ARCHIVE/.*|arvr/projects/rocktenn/projects/ARCHIVE)$",
    "(?-u)^(Research/archive/.*|Research/archive)$",
    "(?-u)^(rocket/.*|rocket)$",
    "(?-u)^third\\-party/rpms/systemd/[^/]*/test$",
    "(?-u)^third\\-party/rpms/systemd/[^/]*/units$",
    "(?-u)^(third\\-party/rust/\\.cargo/.*|third\\-party/rust/\\.cargo)$",
    "(?-u)^(xplat/buck\\-out/.*|xplat/buck\\-out)$",
    "(?-u)^(xplat/infinity/app/applications/facebook\\-intern/sonar/templates/.*|xplat/infinity/app/applications/facebook\\-intern/sonar/templates)$",
    "(?-u)^(xplat/js/node_modules/.*|xplat/js/node_modules)$",
    "(?-u)^xplat/js/react\\-native\\-github/Examples(?:/|/.*/)android/.*$",
    "(?-u)^(xplat/sonar/templates/.*|xplat/sonar/templates)$",
    "(?-u)^(xplat/unity\\-sdk/.*|xplat/unity\\-sdk)$",
    "(?-u)^(?:/?|.*/)[^/]*\\.xcodeproj/.*$",
    "(?-u)^(?:/?|.*/)\\(A Document Being Saved By Xcode\\)/.*$",
    "(?-u)^(?:/?|.*/)[^/]*___jb_bak___$",
    "(?-u)^(?:/?|.*/)[^/]*___jb_old___$",
    "(?-u)^(?:/?|.*/)[^/]*___jb_tmp___$",
    "(?-u)^(?:/?|.*/)__pycache__/.*$",
    "(?-u)^(?:/?|.*/)\\.\\#[^/]*$",
    "(?-u)^(?:/?|.*/)[^/]*\\~$",
];

mod allocator {
    use std::alloc::{GlobalAlloc, Layout, System};
    use std::sync::atomic::{AtomicUsize, Ordering};

    static TOTAL_ALLOC: AtomicUsize = AtomicUsize::new(0);
    static TOTAL_DEALLOC: AtomicUsize = AtomicUsize::new(0);

    #[global_allocator]
    static ALLOCATOR: TotalAllocator = TotalAllocator;

    struct TotalAllocator;

    unsafe impl GlobalAlloc for TotalAllocator {
        unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
            TOTAL_ALLOC.fetch_add(layout.size(), Ordering::Relaxed);
            System.alloc(layout)
        }

        unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
            TOTAL_DEALLOC.fetch_add(layout.size(), Ordering::Relaxed);
            System.dealloc(ptr, layout)
        }
    }

    pub(crate) fn reset() {
        TOTAL_ALLOC.store(0, Ordering::Relaxed);
        TOTAL_DEALLOC.store(0, Ordering::Relaxed);
    }

    pub(crate) fn tell() {
        let total_alloc = TOTAL_ALLOC.load(Ordering::Relaxed) as isize;
        let total_dealloc = TOTAL_DEALLOC.load(Ordering::Relaxed) as isize;
        eprintln!(
            "allocated: {}, retained: {}",
            total_alloc,
            total_alloc - total_dealloc,
        );
    }
}

fn main() {
    allocator::reset();

    let regex_set = RegexSet::new(REGEXES).unwrap();

    eprintln!("eager allocations:");
    allocator::tell();
    allocator::reset();

    thread::scope(|scope| {
        for _ in 0..THREADS {
            scope.spawn(|| {
                for _ in 0..10000 {
                    let path = "fbandroid/apps/instagram/fastparse";
                    let _ = regex_set.is_match(path.as_bytes());
                }
            });
        }
    });

    eprintln!("lazy allocations:");
    allocator::tell();
}

What is the actual behavior?

Regex 1.8.4:

eager allocations:
allocated: 3933520, retained: 397004
lazy allocations:
allocated: 64123704, retained: 12515536

Regex 1.9.1:

eager allocations:
allocated: 3493206, retained: 220812
lazy allocations:
allocated: 602136016, retained: 601652424

What is the expected behavior?

I am interested in knowing whether this amount of allocation is intended for the above regex, or if this might be fixable.

Buck2 is not particularly sensitive to the total amount of allocation done, but very sensitive to steady-state retained memory (with widely used RegexSet objects sitting around) and high-water mark.

BurntSushi commented 1 year ago

This is fixed on crates.io in regex 1.9.2 and regex-automata 0.3.5.

dtolnay commented 1 year ago

Thank you greatly. I confirmed that Buck2 memory usage with regex 1.9.2 and globset 0.4.13 is at or slightly below what we had prior to the regex 1.9 update.