rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.49k stars 438 forks source link

execute a regex on text streams #425

Open cessen opened 6 years ago

cessen commented 6 years ago

This is more-or-less a continuation of issue #25 (most of which is actually here).

Preface

I don't personally have an urgent need for this functionality, but I do think it would be useful and would make the regex crate even more powerful and flexible. I also have a motivating use-case that I didn't see mentioned in the previous issue.

More importantly, though, I think I have a reasonable design that would handle all the relevant use-cases for streaming regex--or at least would make the regex crate not the limiting/blocking factor. I don't have the time/energy to work on implementing it myself, so please take this proposal with the appropriate amount of salt. It's more of a thought and a "hey, I think this design might work", than anything else.

And most importantly: thanks so much to everyone who has put time and effort into contributing to the regex crate! It is no coincidence that it has become such staple of the Rust ecosystem. It's a great piece of software!

My use-case

I occasionally hack on a toy text editor project of mine, and this editor uses ropes as its in-memory text data structure. The relevant implication of this is that text in my editor is split over non-contiguous chunks of memory. Since the regex crate only works on contiguous strings, that means I can't use it to perform searches on text in my editor. (Unless, I suppose, I copy the text wholesale into a contiguous chunk of memory just to perform the search on that copy. But that seems overkill and wouldn't make sense for larger texts.)

Proposal

In the previous issue discussing this topic, the main problem noted was that the regex crate would have to allocate (e.g. a String) to return the contents of matches from an arbitrary stream. My proposed solution essentially amounts to: don't return the content of the match at all, and instead only return the byte offsets. It is then the responsibility of the client code to fetch the actual contents. For example, my editor would use its own rope APIs to fetch the contents (or replace them, or whatever), completely independent of the regex crate.

The current API that returns the contents along with offsets could (and probably should) still be included as a convenience for performing regex on contiguous slices. But the "raw" or "low level" API would only yield byte offsets, allowing for a wider range of use-cases.

Layered API

I'm imagining there would be three "layers" to the API, of increasing levels of convenience and decreasing levels of flexibility:

1. Feed chunks of bytes manually, handling matches as we go

let re = Regex::new("...");
let mut matcher = re::streaming_matcher();

for match in matcher.consume("Some bytes from a larger stream of data that") {
    my_data_source.do_something(match.start(), match.end());
    // Note: there is no match.as_str() for this API.
}

for match in matcher.consume(" we don't know how much there might be in total.") {
    // ...
}

2. Give regex an iterator that yields bytes

let re = Regex::new("...");

for match in re::find_iter_from_bytes_iter("Non-contiguous bytes that can be iterated over.".bytes()) {
    // Again, only match.start() and match.end() for this API.
    // ...
}

3. Give regex a slice, just like the current API

let re = Regex::new("...");

for match in re.find_iter("A directly passed slice of data.") {
    match.as_str(); // In this API you can get the str slice of the match.
    // ...
}

I'm of course not suggesting naming schemes here, or even the precise way that these API's should work. I'm just trying to illustrate the idea. :-)

Note that API 2 above addresses my use-case just fine. But API 1 provides even more flexibility for other use-cases.

Things this doesn't address

BurntSushi noted the following in the previous discussion (referencing Go's streaming regex support):

The most interesting bit there is that the regexp functions on readers finds precisely one match and returns byte indices into that stream. In the Go world, if I wanted to use that and actually retrieve the text, I'd write my own "middleware" reader to save text as its being read. Then I could use that after the first match to grab the matched text. Then reset the "middleware" reader and start searching again.

The problem with that approach: what if there's never a match? You'd end up piling the whole stream into memory (likely defeating the purpose of using a stream).

This proposal doesn't solve that problem, but rather side-steps it, making it the responsibility of the client code to decide how to handle it (or not). Practically speaking, this isn't actually an API problem but rather is a fundamental problem with unbounded streaming searches.

IMO, it doesn't make sense to keep this functionality out of the the regex crate because of this issue, because the issue is by its nature outside of the regex crate. The important thing is to design the API such that people can implement their own domain-specific solutions in the client code.

As an aside: API 1 above could be enhanced to provide the length of the longest potential match so far. For clarity of what I mean, here is an example of what that might look like and how it could be used:

// ...
let mut matcher = re::streaming_matcher();
let mut buffer = String::new();
let mut offset = 0;

loop {
    // Get next chunk of streaming data
    buffer.push(streaming_data_source.get_data());

    // Handle matches
    for match in matcher.consume(&buffer) {
        match_contents = &buffer[(match.start()-offset)..(match.end()-offset)];
        // ...
    }

    // Shrink buffer to smallest size that is still guaranteed to contain all data for
    // potential future matches.
    let pot_match_len = matcher.longest_potential_match_len();
    offset += buffer.len() - pot_match_len;
    buffer.truncate_from_front(pot_match_len);
}

That would allow client code to at least only hold onto the minimal amount of data. Nevertheless, that only mitigates the problem, since you can still have regex's that match unbounded amounts of data.

BurntSushi commented 3 years ago

More concretely, a Cursor could be implemented with a rope or even with a std::fs::File. But not with a plain io::Read.

cessen commented 3 years ago

Ah, right. I mean, I don't think a separate API is strictly necessary for streaming. If there's a reliable upper-limit on how much NFA needs to look back, then streaming implementations can just cache a single block back internally to allow for the minimum amount of look-back.

In fact, a streaming implementation over Read could just have a (comparatively) simple Cursor implementation that does that internally for all Read types.

BurntSushi commented 3 years ago

I see. That's a little tricky though since it kind of relies on the specific sequence of Cursor calls used by a regex engine. But I guess as long as the impl can detect unsupported sequences of calls, it could report an error. So that's a good point!

Marwes commented 3 years ago

https://github.com/rust-lang/regex/issues/425#issuecomment-860268653

The clone requirement in combine was actually relaxed slightly and it now uses "checkpoint"s instead https://docs.rs/combine/4.5.2/combine/stream/trait.ResetStream.html . A checkpoint contains just enough information to rewind the input stream and no more which is sometimes a bit more efficient (and flexible).

Does a regex always know how much of a chunk that can be safely dropped when encountering a boundary? In combine I rely on that assumption when decoding from IO sources while still using parser that only know about and operate on &[u8]. So if a parser reports that it has successfully consumed 100 bytes out of 250 but it could not finish the entire parse, the callee knows it can drop the first 100 bytes and will try to fill the buffer so that parsing can continue. When resuming the parse again with more data the parser will just continue 100 bytes in.

The downside of this is that the parser could have to replay big chunks of the parse if it does not know for certain that it is done with a part that ended in a chunk boundary. The upside is that the parser (or regex engine in this case) only needs to work with plain &[u8] slices (and that these slices may be incomplete, so it needs to report how much of the slice are 100% parsed).

BurntSushi commented 3 years ago

Does a regex always know how much of a chunk that can be safely dropped when encountering a boundary?

The Pike VM ("NFA simulation") does. But none of the other regex engines (backtracker, lazy DFA) do. For the Pike VM, whenever it's at offset i, it knows it will never need to read anything before i-4.

ratmice commented 2 years ago

Just reposting this comment from the other bug report which I've now closed:

There are a lot of issues in the regex crate, most prominently (this one) which discuss streaming regex, but within it there is discussion of https://github.com/rust-lang/regex/issues/425#issuecomment-860040036 Cursor API, I have been thinking a bit through the case where we have a Rope, and have something with an equivalence to String. But don't have a single slice to give find_*.

In the specific Cursor comment, I was thinking about a rather different API than the iteration/cursor based one, where regex just deals with absolute coordinates treating it as one logically contiguous space, where it retrieves which has an API such as:

trait DeltaSlices {
    // Returns a (slice, delta)
    // Returns a single unique slice whose range overlaps bounds
    // where slice[bounds.start - delta..slice.len()] is a valid range over the returned slice.

    fn first_slice(&self, bounds: std::ops::Range<usize>) -> Option<(&[u8], usize)>;
    // The cumulative sum of slice lengths accessible from this.
    fn total_len(&self) -> usize;
}

impl DeltaSlices for &[u8] {
    fn first_slice(&self, _: std::ops::Range<usize>) -> Option<(&[u8], usize)> {
        Some((self, 0))
    }
    fn total_len(&self) -> usize {
        self.len()
    }
}

Edit: FWIW I haven't really thought through lifetimes at all on this, or what with the stablization of GATs a lending style iterator API could provide instead

I'm curious what your thoughts are on this, or making the API more flexible for things such as ropes, but are perhaps more tractable than a fully general streaming API. Given the API being somewhat in transition at this point in time with being after 0.2 but before 0.3 release, If you have any thoughts on both when good timing, and upon which branch would be a good place to start tinkering with such an API, i would be curious to do so.

It may not be a good idea, but it seemed perhaps less branchy than the Cursor api which was discussed, and also keeps things like 'Chunk indexes' from infecting public API, since conversion between absolute and slice relative coordinates is entirely local to the trait.

I guess my biggest gripe with the API is that it isn't Iterator-like enough in the sense that it looks essentially random access. And presents little opportunity for cursor advancement (I find this a funny gripe, because perhaps one gripe with the cursor was that it was too iterator like and not random-access enough, see below quote).

So overall, this API for describing streams seems insufficient. At every turn we make, the search implementations seem to be begging for random access.

Anyhow.

BurntSushi commented 2 years ago

And also re-posting my reply (from https://github.com/BurntSushi/regex-automata/issues/23):

Speaking to logistics first: my main priority right now is getting regex-automata into a place where regex can replace most of its internals with uses of regex-automata. That doesn't include streaming. I just don't have the bandwidth for it right now. With that said, one of the goals of regex-automata is that you don't have to use the higher level search routines. All of the regex engines (including the NFA engines, which are WIP) expose enough of their internals to write your own search routines. For folks needing the streaming use case, I would really love it if they wrote their own search routines first. Then we can have a "meeting of the minds" to see how we might unify them, if at all. Writing search routines for the DFA based regex engines is not trivial in an absolute sense, but it is trivial when compared to everything that goes into turning a regex pattern string into a finite state machine. The NFA search routines will be a bit more involved to write outside of the crate, but it is nevertheless an intended option that this crate will provide. For the DFA case, writing a search routine is small/simple enough that I even include examples of it in the documentation.

It's also worth saying that since I see regex-automata as an "expert level" crate, I don't see many people using it (at least, not near as much as regex itself). For that reason, I don't prioritize the minimization of churn nearly as much, so long as there are no breaking changes in regex proper. I treat regex-syntax similarly. I have no problems releasing breaking changes to regex-syntax at the API level at whatever cadence is convenient. That's why regex-syntax is specifically not a public dependency of regex (nor will regex-automata be). So don't worry too much about stuff not being in regex-automata 0.3. I won't be nearly as resistant to evolving its APIs as I am with regex.

Also, I kinda think it would be better to keep this discussion in https://github.com/rust-lang/regex/issues/425, since it's all pretty inter-connected. Spreading it over two issues makes it kind of annoying to follow the discussion. Also, more people are likely watching that regex issue, which gives more eyes on things.

As for your specific API proposal... I'm not quite sure that I grok it to be honest. It's not clear to me how it would be used by the regex engines. Also, your doc comments appear to be out of sync with the type signatures, which makes things a bit more confusing.


And yes, I think it is valuable to talk about discontiguous strings as a use case that is distinct from the fully general streaming case. It would be nice to unify them, but it's very plausible that the former case can be dealt with more simply. (I think I talk about this a bit in the regex issue.)


Also, note this comment https://github.com/rust-lang/regex/issues/425#issuecomment-860277747 where I call out that Cursor doesn't solve the fully general stream problem. Rather, in theory, it could be implemented via discontiguous strings or with, e.g., a std::fs::File. But not a arbitrary std::io::Read.

BurntSushi commented 1 year ago

I learned about this project today that attempts to search /proc/pid/mem, and really wants to be able to do searching chunk-by-chunk: https://github.com/eras/memgrep

I just wanted to add it here as another use case to look into when thinking about streaming.

iago-lito commented 1 year ago

Hi there in the future :) Since you seem interested in the various use cases for streaming regexes, I might throw in this situation I recently came accross in my new toy project, and why this has led me here again.

I wish to experiment some sort of efficient lexer featuring live rule-based lexical replacement to construct a kind of recursive, lexing-level "macro expansion" system. If we're not willing to keep copying strings around in memory, then this makes the lexed input structure rather interesting. For instance, consider the following input:

a b c d

And the following "expansion rules":

b: u v
v: e
v c: f

When the cursor recognizes b, this pattern is expanded into u v, but I can't just erase my input prefix with that, because I need to keep all information to backtrack in case further lexing control flow requires it. So I'd construct the expanded chunk elsewhere:

  ↓
a b c d  # (original input)
↓
u v  # (expansion of token `b`)

When recognizing u in the expanded string, another expansion step can occur:

  ↓
a b c d 
↓
u v
↓
e

Lexing should resume then, but recognizing v c in the remaining input is non-trivial because, although that "remaining input" is clearly e v c d, there is no such string stored in memory at this point. And I think it should not be needed.

So, when extending my expansion rules LHS to full-fledged regexes, I find myself into a situation where streaming regexes would be the best option again. I would feed them with some simple iterator starting from the lowest cursor and "climbing up" that expansion stack to reconstruct the remaining input: e| v| c d. Here, | markers indicate positions where the iterator jumps, but that the regex engine should be insensitive to.

BurntSushi commented 1 year ago

@iago-lito Have you tried using a fully compiled DFA from regex-automata? That's about the lowest level interface you could have to a regex since you control the state transitions directly.

(The main problem with a DFA is its worst case exponential build time and its memory usage. But maybe those aren't issues for your lexing use case.)

jongiddy commented 1 year ago

One approach to streaming that might be simpler to implement in the current code would be for the *_at functions to accept a Range for the start parameter.

If we have a pattern that matches exactly 10 bytes, then we can scan streamed chunks as they come in, but keep the last 9 bytes of each chunk to prepend to the next chunk to ensure that we detect a pattern that crosses chunk boundaries. This works fine using the current code (ignoring the complexity of word boundaries for now).

In the more realistic situation where we have a pattern that matches, say, 1-10 bytes, to ensure that we find all patterns, we need to keep 9 bytes to prepend to the next chunk. But this also means that we'll find shorter patterns near the end of a chunk twice, once when we scan the chunk initially, and again when we scan the first part of the next chunk.

With a start parameter that takes a Range we can run find_at(chunk, (..chunk.len() - 9)) on the first chunk to avoid the double scanning.

BurntSushi commented 1 year ago

That doesn't work for the case where there is no upper bound on the pattern. (I'm not sure it even works in the cases where there is. I just woke up.)

regex-automata 0.3 will be out soon. When that happens, I'm going to tell people to go implement their ideas and see if they work. I'm convinced this problem is much much harder.

jongiddy commented 1 year ago

I don't mean this to be a general-purpose solution, just to enable some use cases. In many cases, even with unbounded patterns, it is possible to give a limit of how big a match you really care about, especially when throughput is more important than finding extreme matches.

I don't even think that this should be sold as a solution for streaming. More that it is possible to search a buffer, constraining the start of the pattern to a range within the buffer, and this can be useful in some streaming cases.

pascalkuthe commented 1 year ago

And yes, I think it is valuable to talk about discontiguous strings as a use case that is distinct from the fully general streaming case. It would be nice to unify them, but it's very plausible that the former case can be dealt with more simply. (I think I talk about this a bit in the regex issue.)

I have been working on this use case (specifically searching a ropey Rope) using cursors. I am starting to close in on something that works quite well (the code is still rough around the edges but I have run a differential fuzzer for about a trillion iterations so it's a start). I have been using a cursor-style API and essentially building an Input struct that handles that. Currently, It's hardcoded to use ropey but it could be generalized. I am using a cursor-style API (similar to what you described @BurntSushi).

The DFA search routines were really easy to port. The API to transverse bytes really works quite well and I was able to basically copy the search code and make some light adjustments. The search routines are so similar that we might just be able to use the same implementations for both contiguous and discontiguous strigs (and avoid the extra overhead with inlining to avoid the extra branches).

I have no idea how to implement the generic streaming case for the DFA and I am not even sure if it's really feasible.

For the NFA I have currently essentially reimplemented the pikevm. This was necessary because it does not expose any lower-level API than search_slots. The pikevm is actually quite easy to port as you mentioned because it only scans forward one byte at a time with at most four bytes look ahead/look behind (but no reverse search). So a very easy implementation could simply buffer 8 bytes and call the epsilon transition function with the buffer. Currently, I instead reimplemented all look-around assertions for discontiguous strings instead but a small buffer is probably better.

I think using the same cursor style API still makes sense for the pikevm too. Realistically even somebody performing a regex search on a TCP stream or similar likely wants to use a buffered reader anyway and keeping the last 4 bytes of the previous chunk around should be trivial.

The pikevm benefits from a cursor API too for the literal optimizations. I have already implemented a prefilter for discontiguous strings by simply buffering 2*max_needle_len -2 at chunk boundaries. For ropey I can assume a minimum chunk size of 28 for all but the last chunk and that makes that really simple (I simply abort the Prefitler construction for needles longer than that). Smaller chunks could be supported using more complex buffering tough but that probably cost some performance. In most practical scenarios the needle size is likely small compared to the chunk size tough (again even a network would likely have a buffer of at least a kilobyte and most rope chunks also tend to be large) and it's probably a lot faster to search the source buffers directly and only use an internal buffer at chunk boundaries instead of copying all input to internal buffers.

The prefilter implementation is a huge pain to deal with currently because the Prefilter struct is so opaque and I can't get the information I need (maximum needle length) so I had to reimplement the prefilter struct, and all config structs and all the code that interacts with them. Upstreaming just the extra function to the prefilter struct to search a cursor would be extremely useful and reduce the amount of duplicate code by a lot.

@BurntSushi would you be interested in upstreaming the kind of implementation I described? I think if we just had a simple const CAN_BACKTRACK: bool in the cursor trait the meta regex engine could simply disable the DFA engines (if that constant is false). If you are interested I could for example start with a PR that defines the general abstraction for the cursor (and something akin to the Input struct but for cursers) and add an implementation to the Prefilter struct. After that, I could follow up by implementing support for searching cursors in the various engines.

BurntSushi commented 1 year ago

I'm open to small changes currently like, "it would really help to expise the max length or whatever on Prefilter." But I don't have the bandwidth to mentor, merge and maintain the full thing you're talking about. I would want to see it proved out and in real use first anyway. And ideally we would have a few different examples to draw from.

Basically, I'm in no rush here but I'm very open to exposing things where reasonable to lighten your burden. I won't rubber stamp things, but I'm open to that exploration.

pascalkuthe commented 1 year ago

I'm open to small changes currently like, "it would really help to expise the max length or whatever on Prefilter." But I don't have the bandwidth to mentor, merge and maintain the full thing you're talking about. I would want to see it proved out and in real use first anyway. And ideally we would have a few different examples to draw from.

Basically, I'm in no rush here but I'm very open to exposing things where reasonable to lighten your burden. I won't rubber stamp things, but I'm open to that exploration.

Thank you, that sounds very reasonable to me! I will continue my current experiments then. I will start out just making whatever changes I need to regex automaton in a fork then and once my implementation has matured a bit start sending some PRs for some of the small API changes that I will likely need. Especially exposing the maximum prefilter needle length is probably something I will send pretty soon since I see no workaround/alternative to that

cessen commented 1 year ago

@pascalkuthe

For ropey I can assume a minimum chunk size of 28 for all but the last chunk

This isn't directly relevant to this issue (for which I apologize), but I wanted to nip this in the bud. You cannot assume a minimum chunk size from Ropey at all, and this is explicitly stated in the documentation:

There are no guarantees about the size of yielded chunks, and except for CRLF pairs and being valid str slices there are no guarantees about where the chunks are split. For example, they may be zero-sized, they don’t necessarily align with line breaks, etc.

The current implementation happens to ensure minimum chunk sizes as an implementation detail, but it is explicitly not a public API guarantee (it could in theory even change in a patch release!). So you should never depend on it for correctness.

pascalkuthe commented 1 year ago

@pascalkuthe

For ropey I can assume a minimum chunk size of 28 for all but the last chunk

This isn't directly relevant to this issue (for which I apologize), but I wanted to nip this in the bud. You cannot assume a minimum chunk size from Ropey at all, and this is explicitly stated in the documentation:

There are no guarantees about the size of yielded chunks, and except for CRLF pairs and being valid str slices there are no guarantees about where the chunks are split. For example, they may be zero-sized, they don’t necessarily align with line breaks, etc.

The current implementation happens to ensure minimum chunk sizes as an implementation detail, but it is explicitly not a public API guarantee (it could in theory even change in a patch release!). So you should never depend on it for correctness.

I am aware of that and I will need to deal with small chunks for other cursor sources too so this is more of a stopgap to make my life easier implementing a first prototype. Once the chunk size gets smaller than the needle size buffering becomes more complex and I didn't want to deal with that complexity immediately.

cessen commented 1 year ago

Ah, that totally makes sense. I still wanted to clarify for drive-by readers, just in case. But that indeed seems like a good approach for getting started, yeah.

dfoxfranke commented 1 year ago

I just released a rope crate of my own, https://docs.rs/im-rope, so I'm likewise interested in a solution for the case of text representations that aren't capable of implementing AsRef<[u8]> but nonetheless still support efficient random access. This seems like it should be considerably easier than the general streaming case.

BurntSushi commented 1 year ago

@dfoxfranke Have you tried experimenting with the regex-automata DFA (or lazy DFA) APIs to achieve your end?

The "discontiguous strings" versus "streams" cases are discussed quite a bit above. I agree that the former is likely easier than the latter, but I think it is in part also about what kinds of information can be reported. With that said, both are still extremely difficult and tend to be abstraction busting. This is why I'm suggesting regex-automata, because it exposes lower level APIs. For example, you can step over the DFA one transition at a time.

dfoxfranke commented 1 year ago

Yeah, I'm pretty sure I can make it work if I go low enough, just hoping that eventually I won't have to. In an ideal world after future improvements to regex, there'd just be some Haystack trait that my Rope type can implement and then everything else would just work.

BurntSushi commented 1 year ago

@dfoxfranke I doubt that will ever happen. At best, there might be some more convenience APIs in regex-automata, but it doesn't make sense to complicate regex for this niche case. The path here is for people with streaming/rope use cases to experiment with the existing low level APIs and share the abstractions they've built. Over time, this should help create a number of real world use cases that we can draw from to see if it makes sense to build one convenience API that everyone can use.

But that's a lot of work and will likely take calendar years.

dfoxfranke commented 1 year ago

Gotcha. ropey folks (@cessen @pascalkuthe), any interest in collaborating on something like this at some point? I'm thinking we could create a new high-level crate that's a rough workalike of regex, still uses regex-automata and regex-syntax under the hood, but eschews any APIs that assume you have a [u8], in favor of something cursor-based.

pascalkuthe commented 1 year ago

I am working on something like that already. It's already partially working but I don't have the time to finish it. Of course, coupling any implementation to a specific rope implementation doesn't make much sense.

There are also a couple of smaller missing APIS in regex-automaton that I want to send PRs once I refocus on that. Otherwise, you need to duplicate a lot of code.

dfoxfranke commented 1 year ago

@pascalkuthe do you have the code anywhere I could take a look at?

cessen commented 1 year ago

At best, there might be some more convenience APIs in regex-automata, but it doesn't make sense to complicate regex for this niche case.

Honestly, since you started moving the low-level bits to regex-automata, I've felt zero desire for the regex crate itself to support the segmented/streaming use cases anymore. I think you've made an excellent decision with the regex / regex-automata split, and I'm happy for regex-automata to be what I reach for when I need/want lower level control over how regexes execute and are fed data.

I think the main thing now is "just" making the regex-automata implementations and APIs flexible enough to handle these more niche use cases. Which is notably different than adding convenience API's—on the contrary, if anything it's about exposing more and making the APIs even lower level. E.g. manually feeding the pikevm (@pascalkuthe were you working on that?), etc.

BurntSushi commented 1 year ago

Yeah, the PikeVM indeed needs a re-think if it is to support streaming. When I wrote it down, I was thinking that since it is pretty well isolated, interested folks could copy & paste it and then adjust it to their needs. Then once something is working, we can start looking at the API changes necessary and how to evolve the PikeVM in regex-automata in that direction. (Or, perhaps, not if it proves too invasive. Not sure.)

pascalkuthe commented 1 year ago

Yeah I was working on porting the PikeVM. I had a working prototype but was not happy how invasive/complex the changes ended up being. Basically all look around assertions needed to deal with the fact that input could be noncontiguous.

I started working on a simpler approach that hides most of this behind an abstractikn. I think what @BurntSushi said makes a lot of sense here. A streaming regex engine will need additional low level APIs to avoid duplicating a bunch of code (currently I basically copied the entire pike VM) but initially it probably makes sense to just copy the code and upstream additional APIs only once it's clear what is actually needed.

BurntSushi commented 1 year ago

@pascalkuthe Sounds good. There is also the nuclear option: that we add nfa::thompson::pikevm_stream::PikeVM or some such. As in, it's a completely different engine. I'd prefer not to go that way, but if trying to hit both use cases with the same engine leads to a redonkulous API, then the nuclear option might be better.

(The marginal cost of adding another engine is fairly high, but not nearly as high as it was before regex-automata. And certainly less risky. The infrastructure for testing the new engine is in place and it all "just needs to be hooked up." Maintenance going forward is somewhat of a concern, but in practice, the internals of these engines changes very rarely.)

BurntSushi commented 1 year ago

One other thing worth asking here for folks who are working on the streaming case: are the APIs for the DFA and the lazy DFA low level enough to make it possible to build a streaming engine on top of it?

pascalkuthe commented 1 year ago

Yeah I actually got the DFA working first (both lazy and eager). With the exception of https://github.com/rust-lang/regex/pull/1031 and a way to determine the length of a prefilter (so I know how much to buffer) that worked fairly well.

There is obviously sone duplication but it's not too bad: https://github.com/pascalkuthe/ropey_regex/blob/master/src/engines/dfa/search.rs

For the dfa in particular the memchar based acceleration might be nice to reuse (it's a verbatim copy right now IIRC) but it's not too much code

pascalkuthe commented 11 months ago

I have been working on this again a bit and I have run into a small roadblack.

I am trying to be very general with my apporache supporting a cursor API that optionally supports backtracking. The goal would be that inputs that support backtracking (like ropes) can use all engines (dfa, hybrid, ..) while fully streaming inputs (basically any Iterator<&[u8]>) could only use the pikevm (and potentially prefilters).

The problem is that empy module that ensures utf8 codepoint alignment essentially makes all regex engines require some form of backtracking as it restarts the search when the end of a match is not a valid utf8 codepoint. That would make it impossible to fully support the streaming case in unicode mode.

However, I read the documentation of that module carefully and it sounds like this may not be a huge problem in practice, because most practical cases only involve emtpy matches at the start of the haystack. These case doesn't pose an issue because there is not any backtracking involved (note that I require chunks to be Unicode aligned so if the match is at a chunk boundary it is automatically unicode aligned). The only cases which wouldn't work is:

I would be ok with panicking in these two cases with a cursor that don't support backtracking (happen automatically anyway) since they are pretty niche as long as the more common cases work.

@BurntSushi is my understanding correct here or do you have any better ideas how to handle this?

BurntSushi commented 11 months ago

I'm not quite sure to be honest. (Note that (?m:$) and (?m:^) are not Unicode-aware line anchors. This crate doesn't support them at all. You might be thinking of the support for overriding the line terminator to be any particular byte. The empty module docs discuss that briefly at the very end.)

I wonder if there is perhaps a way to "not solve" this problem. I can think of two different approaches to take there:

  1. No support for stream searching for regexes that can match the empty string. Or special case it somehow.
  2. No support for UTF-8 mode. In your comment, it looks like you're conflating Unicode mode with UTF-8 mode. These modes are fully de-coupled in regex-automata. The only thing you lose by not supporting UTF-8 mode is that you can get empty matches that split a codepoint. You might be able to then handle those in a different way (perhaps via the iterator logic described in the empty module docs). UTF-8 mode is somewhat of a special case for things that are guaranteed to be valid UTF-8, such as Rust's &str. For anything that isn't guaranteed to be valid UTF-8, UTF-8 mode has unspecified behavior.

It's also worth point out that, at least for the lazy DFA and fully compiled DFAs, you don't have to use the search APIs that handle empty matches. You can implement your own.

pascalkuthe commented 11 months ago

Thanks for your response @BurntSushi

I'm not quite sure to be honest.

After thinking about it some more I am not sure either. I think its possible that the NFA/DFA would try to match some other pattern and then terminate at the empty match (like foo| matching fox would require backtracking in the empty method I think).

You might be thinking of the support for overriding the line terminator to be any particular byte. The empty module docs discuss that briefly at the very end

yeah sorry, that is what I meant. I didn't want to spell it all out and got tripped up by the documentation referring to that and somehow taught they were the same thing.

It's also worth point out that, at least for the lazy DFA and fully compiled DFAs, you don't have to use the search APIs that handle empty matches. You can implement your own.

Yeah that's what I am doing (I am also reimplementing the pikevm) but I am also reimplementing the copy the empty' module (and using it in my search primitives just like the upstream equivalents) since I ideally wanted to allow the streaming regex engine to handle all the edgecases the regex crate can (for a backtracking cursors that should be possible).

I wonder if there is perhaps a way to "not solve" this problem. I can think of two different approaches to take there:

  • No support for stream searching for regexes that can match the empty string. Or special case it somehow.
    • No support for UTF-8 mode. In your comment, it looks like you're conflating Unicode mode with UTF-8 mode. These modes are fully de-coupled in regex-automata. The only thing you lose by not supporting UTF-8 mode is that you can get empty matches that split a codepoint. You might be able to then handle those in a different way (perhaps via the iterator logic described in the empty module docs). UTF-8 mode is somewhat of a special case for things that are guaranteed to be valid UTF-8, such as Rust's &str. For anything that isn't guaranteed to be valid UTF-8, UTF-8 mode has unspecified behavior.

For my personal usecase of ropey/helix all of this doens't matter since backtracking isn't an issue there. The first option is actually quite intesrting to me. We never want an empty match in helix. I would be happy to just remove those states from the DFA/NFA somehow?

I would like to find a way to at least partially support these cases in the future too (probably using the old hack used in the regex crate for iteration, iteration of cursor without backtracking is probably pretty niche anyway since that only works with earliest match semantics for similar reasons)

BurntSushi commented 11 months ago

I would be happy to just remove those states from the DFA/NFA somehow?

Oh interesting, I wasn't thinking of it this way. I was thinking of it as in "regex compilation returns an error if streaming mode is asked for and it can match the empty string." I'm not sure how to do your idea. What would you do with a regex like (?m:^$) that matches empty lines, for example? I was thinking you'd reject it or perhpas put it through a different path that isn't as optimized. Dunno. It's a half-baked idea.

(It's worth noting that Hyperscan---which supports stream searching---errors on regexes that can match the empty string by default. You have to go out of your way to enable HS_FLAG_ALLOWEMPTY. Although I don't think this is related to streaming, but rather, given Hyperscan's match semantics, it would always result in reporting a match at literally every position.)

ideally wanted to allow the streaming regex engine to handle all the edgecases the regex crate can (for a backtracking cursors that should be possible). ...snip... For my personal usecase of ropey/helix all of this doens't matter since backtracking isn't an issue there.

I do wonder if it makes sense to forget about the "true streaming" use case and really just focus on the "non-contiguous storage" use case. It would be an incremental improvement and it would satisfy some real world use cases. It isn't as operationally flexible which is a bummer.

pascalkuthe commented 11 months ago

Oh interesting, I wasn't thinking of it this way. I was thinking of it as in "regex compilation returns an error if streaming mode is asked for and it can match the empty string." I'm not sure how to do your idea. What would you do with a regex like (?m:^$) that matches empty lines, for example? I was thinking you'd reject it or perhpas put it through a different path that isn't as optimized. Dunno. It's a half-baked idea.

For my specific usecase I essentially want to select all text that matches a regex pattern. An empty match essentially leads to no selection and therefore can be ignored (they actually lead to bugs currently and I was going to work around it donwstream by just checking the length of each match in a .filter() so just "removing" the empty state from the NFA would work quite well for me.

Reporting an error for you example could be fine but repetitions tend to be a bit annoying. Patterns like (foo)* are pretty common. * and + are often used interchangeably if people don't pay too much attention (like when working interactively in a text editor). So to me just removing the traditions from the NFA that could lead to matching an empty string would be perfects.

Ofcourse that is kind of specific to my usecase and reporting an error is probably the better general solution (I would love a better solution for (foo)* tough, maybe something can be done at a syntactic level?)

I do wonder if it makes sense to forget about the "true streaming" use case and really just focus on the "non-contiguous storage" use case. It would be an incremental improvement and it would satisfy some real world use cases. It isn't as operationally flexible which is a bummer.

I have been tempted by this too. It may indeed make more sense to do that for now. I actually have no specical codepath for the non-backtracking case right now. I have a cursor trait:

pub trait Cursor {
    fn chunk(&self) -> &[u8];
    /// Whether this cursor can be used for unicode matching That
    /// means specifically that it promises that unicode codepoints are never
    /// split across chunk boundaries
    fn utf8_aware(&self) -> bool;
    /// Returns true if successful, false at EOI
    fn advance(&mut self) -> bool;
    /// Returns true if successful, false at EOI
    /// or if backtracking is not supported
    fn backtrack(&mut self) -> bool;
}

and the backtrack function will simply always fail for the cursor that can't backtrack. I havn an abstraction around this that will track the position of the cursor that will panic if backtack fails at an non-zero offset.

So for regex searches that won't run into these restrictions you can totally still use my crate. I was mostly trying to remove the cases where it would panic but it might be better to ignore that usecsase for now.

Now that you mention it this API probably isn't too useful for something "fully" streaming as they would probably want to treat the regex search more like a state machine (and use regex-automata directly would work pretty well in that case I think, I imagine these are highly specific use-cases that may even be able to constrain their regex patterns further).

So that was a lot of words to say, you are right I probably shouldn't focus on the fully streaming case for now :D

BurntSushi commented 11 months ago

Yeah I like the idea of building a Shitty First Draft that targets what you need specifically first. That will get you (and us) some real world experience with a prototype. And then hopefully it can be iterated on and improved in the future.

pascalkuthe commented 8 months ago

I managed to implement a fully working meta regex engine (pikevm + dfa + hybrid + prefilter) that passes all regex tests at https://github.com/pascalkuthe/regex-cursor.

I put a short summary in the readme. I think it would be nice to upstream this eventually since I had to duplicate a ton of private code (primarily in the meta engine) and I am a bit worried about maintaining the duplication in the long run. The cursor API that I came up with is very generic and the pikevm implementation could be made to fit fully streaming input in the future (altough with heavy limitation see the readme).

Ofcourse this is just a prototype and there is a long road to getting this upstream but I want to get the ball rolling a bit. A couple points:

There is also some stuff missing:

BurntSushi commented 8 months ago

@pascalkuthe Wow, that is amazing. I'll have to take a deeper look soon.

Maybe some different abstraction could work that would essentially be an Input trait

The main problem here I think, and why I designed it the way I did, was to avoid making everything in the crate polymorphic. My suspicion is that with Input being a trait, compile times would suck even worse than they do today.

pascalkuthe commented 8 months ago

yeah, that is a good point, maybe the boilerplate cost is work paying to avoid bloating compile-time. It may also make sense to keep all the cursor stuff behind feature flags. There is a lot of code that is just fully copy pasted

BurntSushi commented 8 months ago

Yeah I'm definitely overall in favor of finding a way to upstream your work. It's a really nice use case to serve if we can do it. But I'll want to understand more about the trade offs.

Are you planning to publish the crate? (So that it's easy to read the crate docs.)

pascalkuthe commented 8 months ago

yeah I will publish the crate later today but full disclaimer: The docs still need work. They are nowhere near the great documentation upstream has (most of the docs are also copied from upstream where API was covered).

The public API probably won't be too interesting for you though since it's mostly a carbon copy from regex-automata (including docs). Mostly the cursor trait will be interesting.

BurntSushi commented 8 months ago

Yeah no/poor docs is cool. Just want to navigate the types and what not.

pascalkuthe commented 8 months ago

I polish the docs up a bit and pushed it to crates.io, should be on docs.rs in a couple minutes: https://docs.rs/regex-cursor/

ratmice commented 8 months ago

Had a look at what would be involved with implementing the overlapping search routines, and indeed I see what you mean by lots of copy/paste since these rely on some pub crate fields of e.g. OverlappingState. Not sure how far I want to jump down that rabbit hole.

Edit: FWIW the impression I get it seems overall likely that i'd be better just running matches over many regexp individually, than trying to complete that aspect of the API if only for maintainability sake.

CeleritasCelery commented 8 months ago

@pascalkuthe have you run any benchmarks against your implementation? I am curious how it compares as a baseline.

pascalkuthe commented 8 months ago

no haven't got around to that yet. The performance will mostly depend on your collection (a collection with small chunks and slow cursor will see a much larger impact than one with large chunks and a fast cursor). I will only be benchmarking ropey as that is the only practical usecase I have.

There will be some slowdown from the cases being accelerated by the engines/strategies not implemented yet but that is only temporary.

The only thing where performance would really be interesting is the prefilter since that does actually has some additional complexity. The rest I would expect to be very close to upstream regex if chunk breaks are reasonably rare (which they are in ropeys case).

CeleritasCelery commented 8 months ago

I am really excited for this feature! Thank you for putting in so much work on path clearing it. I took some time to benchmark it using rebar to get baseline results. This is for the unicode, curated, dictionary, wild, and reported benchmarks. I only included results where there is more than a 10% difference (there were 54 benchmarks where the difference was the same or less than 10%). I don't know these benchmarks or the regex internals well enough to say which of these come from inherit differences between the two implementations, and which come from missing engines/strategies.

benchmark                                           rust/regex           rust/regex/cursor
---------                                           ----------           -----------------
curated/01-literal/sherlock-en                      23.1 GB/s (1.00x)    20.6 GB/s (1.12x)
curated/04-ruff-noqa/real                           1146.8 MB/s (1.00x)  540.9 MB/s (2.12x)
curated/05-lexer-veryl/single                       7.1 MB/s (1.00x)     696.2 KB/s (10.38x)
curated/06-cloud-flare-redos/original               408.2 MB/s (1.00x)   222.8 MB/s (1.83x)
curated/06-cloud-flare-redos/simplified-long        44.8 GB/s (1.00x)    22.3 GB/s (2.00x)
curated/07-unicode-character-data/parse-line        265.3 MB/s (1.00x)   26.2 MB/s (10.14x)
curated/09-aws-keys/full                            1353.5 MB/s (1.00x)  1119.8 MB/s (1.21x)
curated/11-unstructured-to-json/extract             66.6 MB/s (1.00x)    24.1 MB/s (2.77x)
curated/14-quadratic/2x                             6.0 MB/s (1.10x)     6.7 MB/s (1.00x)
dictionary/search/english                           86.9 MB/s (1.00x)    18.5 MB/s (4.70x)
dictionary/search/english-tiny                      145.5 MB/s (1.00x)   24.2 MB/s (6.00x)
dictionary/search/english-10                        153.7 MB/s (1.00x)   14.4 MB/s (10.65x)
reported/i13-subset-regex/original-ascii            4.1 GB/s (1.00x)     344.4 MB/s (12.15x)
reported/i13-subset-regex/original-unicode          255.2 MB/s (1.00x)   12.0 MB/s (21.22x)
reported/i13-subset-regex/big-ascii                 65.7 MB/s (1.00x)    10.5 MB/s (6.24x)
reported/i13-subset-regex/big-unicode               11.7 MB/s (1.00x)    10.5 MB/s (1.12x)
reported/i13-subset-regex/huge-ascii                11.1 MB/s (1.00x)    9.9 MB/s (1.13x)
reported/i13-subset-regex/huge-unicode              11.2 MB/s (1.00x)    9.9 MB/s (1.14x)
reported/i13-subset-regex/huge-ascii-nosuffixlit    11.1 MB/s (1.00x)    9.9 MB/s (1.13x)
reported/i13-subset-regex/huge-unicode-nosuffixlit  11.2 MB/s (1.00x)    9.8 MB/s (1.15x)
unicode/codepoints/letters-one                      11.3 MB/s (1.00x)    10.1 MB/s (1.12x)
unicode/codepoints/letters-alt                      11.2 MB/s (1.00x)    10.1 MB/s (1.11x)
unicode/overlapping-words/ascii                     42.2 MB/s (1.00x)    19.9 MB/s (2.12x)
unicode/overlapping-words/english                   6.2 MB/s (1.00x)     2.9 MB/s (2.12x)
unicode/overlapping-words/russian                   5.8 MB/s (1.00x)     2.5 MB/s (2.30x)
unicode/word/around-holmes-english                  28.6 GB/s (1.00x)    584.9 MB/s (50.10x)
wild/bibleref/short                                 35.0 MB/s (1.00x)    9.4 MB/s (3.74x)
wild/bibleref/line                                  475.6 MB/s (1.00x)   423.9 MB/s (1.12x)
wild/caddy/caddy                                    349.5 MB/s (1.00x)   36.1 MB/s (9.69x)
wild/dot-star-capture/rust-src-tools                433.9 MB/s (1.00x)   21.1 MB/s (20.55x)
wild/parol-veryl/ascii                              7.0 MB/s (1.00x)     698.4 KB/s (10.30x)
wild/parol-veryl/unicode                            4.9 MB/s (1.00x)     461.0 KB/s (10.81x)
wild/parol-veryl/multi-captures-ascii               18.4 MB/s (1.00x)    5.1 MB/s (3.64x)
wild/ruff/whitespace-around-keywords                213.2 MB/s (1.00x)   91.4 MB/s (2.33x)
wild/ruff/noqa                                      1132.9 MB/s (1.00x)  539.5 MB/s (2.10x)
wild/ruff/space-around-operator                     322.2 MB/s (1.00x)   174.0 MB/s (1.85x)
wild/ruff/shebang                                   680.9 MB/s (1.37x)   930.6 MB/s (1.00x)
wild/rustsec-cargo-audit/original-unix              17.0 GB/s (1.00x)    10.8 GB/s (1.57x)
wild/rustsec-cargo-audit/original-windows           16.4 GB/s (1.00x)    9.8 GB/s (1.67x)
wild/rustsec-cargo-audit/both-slashes               17.1 GB/s (1.00x)    10.3 GB/s (1.65x)
wild/rustsec-cargo-audit/both-alternate             17.1 GB/s (1.00x)    11.4 GB/s (1.51x)
pascalkuthe commented 8 months ago

I think kw ecsn ignore everything under 20% for now.

I haven't gone trough all of them but for most of the big offenders (that are an order magnitude slower) I think I identified the cause.