rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.51k stars 440 forks source link

decrease memory usage of DFA with variable width delta encoding of instruction pointers #199

Closed BurntSushi closed 8 years ago

BurntSushi commented 8 years ago

(This is a ticket that I feel has limited scope, and as such, I'd be happy to mentor it. I think I'd consider this "medium" if one has worked with finite automata before, and probably "hard" if not.)

The lazy DFA works by computing states as they are visited during search. At most one state is created per byte of the haystack. In the vast majority of cases, the time it takes to do state construction is negligible because they are reused throughout search. However, the DFA does have pathological cases that can lead to exponential blow up in the number states. In particular, the worst case is when every byte of input results in a new state being created. For this reason, the DFA puts a bound on how many states are stored. If that bound is reached, all existing states are deleted and the DFA continues on (possibly recomputing states that were previously deleted). Since creating states can be slow, if the DFA clears the state cache too frequently, then the DFA will give up and a slower matcher will run in its place.

Therefore, given some fixed memory requirement of N bytes, it makes sense to try and figure out how we can store as many states as possible in N bytes. The more states we can fit into N bytes, the less frequently we flush the cache, which in turn increases the number of regexes and haystacks that we can process with the DFA.

Let's take a look at how State is defined:

struct State {
    next: Box<[StatePtr]>,
    insts: Box<[InstPtr]>,
    flags: StateFlags,
}

next corresponds to the state's transitions. insts corresponds to the set of NFA states that make up this single DFA state. flags indicates some facts about the state, such as whether it is a match state, whether it observes word characters (to support word boundary assertion lookaround) and whether any of its NFA states correspond to empty assertions.

One somewhat straight-forward way to reduce the memory requirements is to optimize this representation. flags is probably as small as it's going to get: it's a single byte. next is also unfortunately as small as it will get since it must support fast random access. Notably, it is used inside the DFA's inner loop. insts however is not used in the DFA fast path and is only used during state construction. State construction is permitted to be somewhat slow, which means we should be able to optimize for space at the expense of time.

Currently, insts is a sequence of 32 bit pointers to instructions in an NFA program. This means that every NFA state in a DFA state occupies 4 bytes. Some DFA states can contain many NFA states, especially regexes that contain large Unicode character classes like \pL. It is also my hypothesis that the set of instruction pointers in each DFA state probably occur close together in the NFA graph. This means that the overall delta between the pointers is probably small.

This means that we should be able to take advantage of delta compression. That is, insts would change its representation from a Box<[InstPtr]> to a Box<[u8]>, where we would write delta encoded pointers. For example, consider this sequence of instruction pointers:

[2595, 2596, 2597, 2598, 2599, 2600, 2601]

These are real instruction pointers extracted from the (a|b|c|d) portion of the regex \pL(a|b|c|d). The actual program can be seen using the regex-debug tool:

$ regex-debug compile '\pL(a|b|c|d)' --dfa
...
2595 Split(2596, 2597)
2596 Bytes(a, a) (goto: 2602)
2597 Split(2598, 2599)
2598 Bytes(b, b) (goto: 2602)
2599 Split(2600, 2601)
2600 Bytes(c, c) (goto: 2602)
2601 Bytes(d, d)
2602 Match(0)

Notably, each of these instruction pointers could be represented with two bytes, but they are always represented by four bytes today. In fact, we can do better with a delta encoding. After applying a delta encoding, here's what the new instruction pointers look like:

[2595, 1, 1, 1, 1, 1, 1]

Now, only the first pointer requires two bytes while the remaining require a mere one byte. All we need to do is apply this delta encoding to our insts list, then encode the resulting pointers using a variable-length encoding, like the one used in protocol buffers. We would then need a way to decode and iterate over these pointers, which is needed here: https://github.com/rust-lang-nursery/regex/blob/2ab7be3d043a1ef640dc58ec4a4038d166ba1acd/src/dfa.rs#L644

And that should do it.

Here are some other thoughts:

  1. It would be cool to get the memory usage down low enough to improve the sherlock::repeated_class_negation benchmark, although I think it might not be quite enough. (I think its DFA requires around 5MB of space, and I don't think this optimization will eliminate more than half of the memory required, but maybe it will.)
  2. Ideally, we don't use unsafe and don't use any additional dependencies. I think the encoding scheme is simple enough that we can hand roll it. We shouldn't need any unsafe because we just don't care that much about CPU cycles spent here.
  3. Tests may be hard. Currently, the only real test that is even close to this issue is dfa_handles_pathological_case in tests/crazy.rs, but it's not clear how that can be used to tese this optimization in particular. My feeling is that the most important tests for this addition already exist. The benchmarks will also catch us if we do something really dumb like increase the memory requirements.
killercup commented 8 years ago

Hey @BurntSushi, I may be interested in working on this!

I've been meaning to work on regex (because it's really interesting), but I haven't had much time. I haven't done much work with finite automata (outside of my theoretical CS courses), so it'll probably take me some time to get started. (If you want to get this done quickly, I'll gladly let someone else take over, of course.)

BurntSushi commented 8 years ago

@killercup No need to have it done quickly! I'm happy to mentor any experience level. :-)

BurntSushi commented 8 years ago

@killercup To adjust the difficulty level a bit: I think the actual optimization work should not be that bad, since it's really just a simple encoding/decoding scheme. The reason why I ranked the difficulty higher than that is because the DFA code is dense and it may take a bit of time to grok it!

BurntSushi commented 8 years ago

Note that PR #202 changes the representation of State somewhat. In particular, the transitions are no longer stored in a State proper. So the representation now looks like this:

struct State {
    insts: Box<[InstPtr]>,
    flags: StateFlags,
}

I think everything I said above still applies.

Another thought: if insts goes to a Box<[u8]>, then it seems sensible to store the flag byte in the first byte of insts, which should shave off 7 bytes of overhead per state value because of struct alignment/padding. This means that State simply becomes struct State(Box<[u8]>). Cool!

killercup commented 8 years ago

Very nice! Thanks for the heads up. I'll keep that in mind :smiley:

BurntSushi commented 8 years ago

@killercup Did you end up deciding whether you wanted to tackle this?

killercup commented 8 years ago

@BurntSushi Yes, I still want to do this! I have spend some time experimenting and thinking about this, but haven't had enough time yet to do something meaningful. I hope that changes in the next weeks. I'll keep you up-to-date :)

BurntSushi commented 8 years ago

@killercup Great! Look forward to it. I tried assigning you to this issue, but I guess I can only assign project members. Sorry. :-(

BurntSushi commented 8 years ago

@killercup It turns out that this doesn't help the repeated_class_negation benchmark. It simply requires too many states. Of course, even if we did fix it, there will always be regexes that exceed the cache size. Still, this change should expand the class of regexes executable by the DFA.