rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.53k stars 439 forks source link

LRU cache for dynamic regexs #233

Closed alubbe closed 8 years ago

alubbe commented 8 years ago

Most of my regexs not known at rust's compile time, but I would like to avoid the cost of constantly calling Regex::new on the same regexs. Ideally, I'd like to use an LRU cache, either globally/library-wide or as part of a struct. static mut CACHE: LruCache<String, Regex> = LruCache::new(10); fails because mutable statics are not allowed to have destructors and

  // part of a struct initialization
  let mut cache: LruCache<&str, &Regex> = LruCache::new(10);

  // in a method call, given a re_str
  let re = Regex::new(&re_str).unwrap().to_owned();
  cache.insert(&re_str, &re);

fails because re does not live long enough. Is there any way to build an LRU cache for dynamic regexs?

BurntSushi commented 8 years ago

Rust doesn't have a concept of life-before-main, so trying to run arbitrary code for the initialization of a static is a no-go.

With that said, the lazy_static! macro as part of the lazy_static crate is typically a good way to work around that, which will transparently initialize your state on first use and is properly synchronized for simultaneous access from multiple threads. Is there any particular reason why lazy_static! didn't work for you? There is an example use in the regex crate documentation. It seems plausible to initialize your LruCache using lazy_static!.

alubbe commented 8 years ago

If I understand it correctly, I cannot declare a static mut using lazy_static, so I've whipped up the following attempt using mutex. But just as with re before, the borrow checker prints that borrowed value does not live long enough - in this case referring to &asd2. Where am I going wrong?

#[macro_use] extern crate lazy_static;
extern crate lru_cache;
extern crate regex;
use lru_cache::LruCache;
use regex::Regex;
use std::sync::Mutex;

lazy_static! {
  static ref CACHE: Mutex<LruCache<String, Regex>> = Mutex::new(LruCache::new(10));
}

fn foo() {
  let bar = "bar";
  let re = Regex::new(&bar).unwrap();
  let asd = String::from("asd");
  CACHE.lock().unwrap().insert(asd, re);

  let asd2 = String::from("asd");
  let cached_re = CACHE.lock().unwrap().get_mut(&asd2).unwrap();
}
BurntSushi commented 8 years ago

Instead of

let cached_re = CACHE.lock().unwrap().get_mut(&asd2).unwrap();

try

let cache = CACHE.lock().unwrap();
let cached_re = cache.get_mut(&asd2).unwrap();

The problem with your current approach is that CACHE.load().unwrap() won't live long enough otherwise.

In the future, it would be better to post these kinds of questions at our user forum. That's not to say they are particularly unwelcome here, but you'll have the benefit of a lot more eyes on your problem in the forum then here. :-)

alubbe commented 8 years ago

Oh of course. Thanks!

alubbe commented 8 years ago

Quick follow-up question - do you have any estimate of how much memory each of those compiled regex instances takes up, or how to measure it?

BurntSushi commented 8 years ago

It varies wildly depending on the regex. Using Unicode character classes is a quick way to make them very large. The total size could be any where from low kilobytes to low megabytes.

You have some control over how much space they're allowed to use: http://doc.rust-lang.org/regex/regex/struct.RegexBuilder.html