rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.49k stars 438 forks source link

execute a regex on text streams #425

Open cessen opened 6 years ago

cessen commented 6 years ago

This is more-or-less a continuation of issue #25 (most of which is actually here).

Preface

I don't personally have an urgent need for this functionality, but I do think it would be useful and would make the regex crate even more powerful and flexible. I also have a motivating use-case that I didn't see mentioned in the previous issue.

More importantly, though, I think I have a reasonable design that would handle all the relevant use-cases for streaming regex--or at least would make the regex crate not the limiting/blocking factor. I don't have the time/energy to work on implementing it myself, so please take this proposal with the appropriate amount of salt. It's more of a thought and a "hey, I think this design might work", than anything else.

And most importantly: thanks so much to everyone who has put time and effort into contributing to the regex crate! It is no coincidence that it has become such staple of the Rust ecosystem. It's a great piece of software!

My use-case

I occasionally hack on a toy text editor project of mine, and this editor uses ropes as its in-memory text data structure. The relevant implication of this is that text in my editor is split over non-contiguous chunks of memory. Since the regex crate only works on contiguous strings, that means I can't use it to perform searches on text in my editor. (Unless, I suppose, I copy the text wholesale into a contiguous chunk of memory just to perform the search on that copy. But that seems overkill and wouldn't make sense for larger texts.)

Proposal

In the previous issue discussing this topic, the main problem noted was that the regex crate would have to allocate (e.g. a String) to return the contents of matches from an arbitrary stream. My proposed solution essentially amounts to: don't return the content of the match at all, and instead only return the byte offsets. It is then the responsibility of the client code to fetch the actual contents. For example, my editor would use its own rope APIs to fetch the contents (or replace them, or whatever), completely independent of the regex crate.

The current API that returns the contents along with offsets could (and probably should) still be included as a convenience for performing regex on contiguous slices. But the "raw" or "low level" API would only yield byte offsets, allowing for a wider range of use-cases.

Layered API

I'm imagining there would be three "layers" to the API, of increasing levels of convenience and decreasing levels of flexibility:

1. Feed chunks of bytes manually, handling matches as we go

let re = Regex::new("...");
let mut matcher = re::streaming_matcher();

for match in matcher.consume("Some bytes from a larger stream of data that") {
    my_data_source.do_something(match.start(), match.end());
    // Note: there is no match.as_str() for this API.
}

for match in matcher.consume(" we don't know how much there might be in total.") {
    // ...
}

2. Give regex an iterator that yields bytes

let re = Regex::new("...");

for match in re::find_iter_from_bytes_iter("Non-contiguous bytes that can be iterated over.".bytes()) {
    // Again, only match.start() and match.end() for this API.
    // ...
}

3. Give regex a slice, just like the current API

let re = Regex::new("...");

for match in re.find_iter("A directly passed slice of data.") {
    match.as_str(); // In this API you can get the str slice of the match.
    // ...
}

I'm of course not suggesting naming schemes here, or even the precise way that these API's should work. I'm just trying to illustrate the idea. :-)

Note that API 2 above addresses my use-case just fine. But API 1 provides even more flexibility for other use-cases.

Things this doesn't address

BurntSushi noted the following in the previous discussion (referencing Go's streaming regex support):

The most interesting bit there is that the regexp functions on readers finds precisely one match and returns byte indices into that stream. In the Go world, if I wanted to use that and actually retrieve the text, I'd write my own "middleware" reader to save text as its being read. Then I could use that after the first match to grab the matched text. Then reset the "middleware" reader and start searching again.

The problem with that approach: what if there's never a match? You'd end up piling the whole stream into memory (likely defeating the purpose of using a stream).

This proposal doesn't solve that problem, but rather side-steps it, making it the responsibility of the client code to decide how to handle it (or not). Practically speaking, this isn't actually an API problem but rather is a fundamental problem with unbounded streaming searches.

IMO, it doesn't make sense to keep this functionality out of the the regex crate because of this issue, because the issue is by its nature outside of the regex crate. The important thing is to design the API such that people can implement their own domain-specific solutions in the client code.

As an aside: API 1 above could be enhanced to provide the length of the longest potential match so far. For clarity of what I mean, here is an example of what that might look like and how it could be used:

// ...
let mut matcher = re::streaming_matcher();
let mut buffer = String::new();
let mut offset = 0;

loop {
    // Get next chunk of streaming data
    buffer.push(streaming_data_source.get_data());

    // Handle matches
    for match in matcher.consume(&buffer) {
        match_contents = &buffer[(match.start()-offset)..(match.end()-offset)];
        // ...
    }

    // Shrink buffer to smallest size that is still guaranteed to contain all data for
    // potential future matches.
    let pot_match_len = matcher.longest_potential_match_len();
    offset += buffer.len() - pot_match_len;
    buffer.truncate_from_front(pot_match_len);
}

That would allow client code to at least only hold onto the minimal amount of data. Nevertheless, that only mitigates the problem, since you can still have regex's that match unbounded amounts of data.

BurntSushi commented 8 months ago

Those are really impressive numbers. Wow. Yes there are a few 10x in there, but not many. I was expecting things to be a lot worse. (I still haven't looked at the code yet though.) Nice work @pascalkuthe.

pascalkuthe commented 6 months ago

Maybe a small update: refex-cursor was merged into helix master a while ago and was included in the new 24.03 release so it has a pretty large userbase. Seems to hold up great to scrutiny so far

8573 commented 1 month ago

I recently saw that another Rust implementation of regexes over streams, ergex, was released a few years ago, which I thought to mention here, as it does not seem to have been known to this discussion previously: https://github.com/deadpixi/ergex. Warning: It is licensed incompatibly with regex.