rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.51k stars 440 forks source link

Add Min DFA for a regex #1159

Closed Bisht13 closed 6 months ago

Bisht13 commented 8 months ago

I want an API that allows me to get the min DFA for a particular regex, for example, take this regex "(a|b)+d", I would want a min DFA of this nature,

image

Something which is parsable. Currently I tried using the regex-automata crate using this piece of code,

use regex_automata::dfa::onepass::DFA;
let re = DFA::new(r"(a|b)+d").unwrap(); 
println("{:?}", re);

But could only get the Debug information which I couldn't make a sense of for being able to parse it to create min DFA myself.

onepass::DFA(
D 000000: 
  000001: \x01 => 2 (S-0)
  000002: \x01 => 2 (S-0-1), \x03 => 3 (S-1)
* 000003 (0): 

START(ALL): 1
state length: 4
pattern length: 1
)

Thus, I feel exposing the min DFA would be quite helpful.

BurntSushi commented 8 months ago

Why are you using a onepass DFA?

Why not use a dense DFA and enable the minimize option? https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.minimize

BurntSushi commented 8 months ago

And the onepass DFA you show is the same as the DFA graph you show.

Bisht13 commented 8 months ago

Why are you using a onepass DFA?

Why not use a dense DFA and enable the minimize option? https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.minimize

I did and there are too many states as compared to what should be there.

image
BurntSushi commented 8 months ago

Only because it supports unanchored and anchored searches. You can change that also in the configuration. The DFA you show only does an anchored search.

BurntSushi commented 8 months ago

https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.start_kind

Bisht13 commented 8 months ago

Still too many states

image

Also, how do I get the DFA?

BurntSushi commented 8 months ago

No, that's a minimal DFA. There's exactly one extra state to account for the end-of-input transition.

And you have the DFA. I don't know what you're asking for otherwise. Most of its functionality is made available via the Automaton trait.

You haven't shared any code, and the only thing I have to go on is that you want a minimal DFA. But just because the thing you have isn't precisely identical to what you might find in a textbook about DFAs doesn't mean it isn't minimal.

I would really encourage you to read more of the docs here. I realize there is a lot of it, but this is a necessarily complex API to support there engineering necessary to make a general purpose regex engine work with finite automata.

Otherwise, it seems to me like there might be an XY problem here. A minimal DFA is a means to an end, not an end to itself.

Bisht13 commented 8 months ago

Thank you so much for your reply and time. Actually I need to create a JSON which I'd use for a js project via wasm. Yes, it gives the states but the transitions can't be comprehended. I don't know how \x01 is referring to (a|b), like some sort of mapping would have had been helpful.

BurntSushi commented 8 months ago

Please read the docs for the DFA configuration. And then the Automaton docs. They should answer everything. This crate uses alphabet compression, and there is a knob called byte_classes that explains this. And the mapping can be accessed: https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.byte_classes

like some sort of mapping would have had been helpful.

You say this as if the mapping isn't accessible to you. But it is. It's all there.

BurntSushi commented 8 months ago

If you're using the DFA from this crate but not its search routines, you'll need to read the Automaton docs very carefully to implement your own search routines. This is not your average simplistic DFA library. :)

Bisht13 commented 8 months ago

Got it, I think more or less I have everything that I need, one last question, can I set any sort of configuration to avoid the end of input (EOI) and sanity check (\x00-\xFF) so as to remove them from the DFA?

BurntSushi commented 8 months ago

can I set any sort of configuration to avoid the end of input (EOI)

Nope. I'm not aware of any compelling reason to do so.

sanity check (\x00-\xFF)

I don't know what you mean by this. There is no "sanity check" in the DFAs generated by this crate.

Bisht13 commented 8 months ago

Why are both state 3 and 5 there? State 5 is not needed at all and the elements from 4 can just point to 3. How is this min DFA?

image
BurntSushi commented 8 months ago

Because 3 is a match state. 5 is not.

Bisht13 commented 8 months ago

For regex [^\\.]+, this shouldn't be the min DFA.

image
BurntSushi commented 8 months ago

Yes it should.

Bisht13 commented 8 months ago

No, this is the correct DFA, there should only be 2 states in theory.

image
BurntSushi commented 8 months ago

I'll be frank here: I can't keep going back-and-forth with you like this. It isn't my job to constantly prove that your unstated (and unknowable, to me) expectations of what a "minimal DFA" are for any given regex are wrong. That's not to say that it's impossible for this library to have any bugs, but if you're going to assert that its output is incorrect, I really do need you to spend a little more time on your end understanding the library you're using. In the example of your most recent comment for example, there are multiple different ways to interpret it:

Your comments leave me guessing as to which of these possibilities is actually in play here. This is an enormous amount of work on my end, and while I want to make myself available to answer questions, I have to put up a boundary here and ask you do a little more work on your end because I can't keep doing this.

No, this is the correct DFA

Stop asserting this. It isn't the correct DFA because it doesn't match the semantics of the regex you're using.

Bisht13 commented 8 months ago

I'll tell you what I want, maybe it might not be in the scope of this crate. Suppose I have a text dabcdabc and I have a regex abc. Now this will have 2 matches, both abc with different indexes. Now I want to have a min DFA, where I can parse these matches, i.e., abc. So, in this case, 0 -a-> 1 -b-> 2 -c-> 3*.

I understand that the crate right now works in a way where I can capture either anywhere in between the text (unanchored) or from the beginning or end (anchored). For my case, the first use of the crate is straightforward, getting the matches, I'm having difficulty in the second DFA where I need a DFA which is able to parse the match, that is it. I want to this because I want to create a zero knowledge proof of the regex's capture's existence (working on zk-regex). If the second DFA is not already present in the crate itself (as it may not be needed), then also it's completely fine. Sorry for bothering and disturbing you so much.

BurntSushi commented 8 months ago

This crate does not let you say, "I want a DFA that has precisely these states." The crate let's you say, "I want a DFA that matches this pattern, with these semantics and minimize it such that it is as small as possible given the configuration and assumptions made by this crate." The "assumptions made by this crate" is not meant to be a weasel phrase. There is nothing in this crate that will materially impact the minimal size of a DFA. It may cause minimal DFAs to be a couple states bigger (a constant factor), but that's it. Separately from that, a minimal DFA may be bigger than you expect if you don't understand the semantics of the regex pattern you've written. [^a] for example is not a single DFA state. It is many. But (?-u:[^a]) is a single DFA state. Unicode matters here and it is enabled by default.

I understand that the crate right now works in a way where I can capture either anywhere in between the text (unanchored) or from the beginning or end (anchored). For my case, the first use of the crate is straightforward, getting the matches, I'm having difficulty in the second DFA where I need a DFA which is able to parse the match, that is it.

So you want an anchored search? Then that's this, as I mentioned above: https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.start_kind

... but you also said:

Suppose I have a text dabcdabc and I have a regex abc. Now this will have 2 matches, both abc with different indexes. Now I want to have a min DFA, where I can parse these matches

So if you want to find the match abc in dabcdabc and starting at the beginning, then you need an unanchored search. Not an anchored search. Your DFA of 0 -a-> 1 -b-> 2 -c-> 3* is incorrect.

Sorry for bothering and disturbing you so much.

I appreciate that. But to be clear, I don't want an apology. I want you to do some work on your end. I tried to explain why your approach was extremely time consuming for me, because I don't know what it is that you want. And even now, with your latest response, it seems this entire issue is even more confused than it was. You say you want to find matches anywhere in a haystack, but then go on to say that you want an anchored search.

I think at this point I really do have to insist that you provide a minimal working program, even if it doesn't do what you want. Write down the commands to compile and run it. Then tell me the output that you see. And then tell me the output that you want to see. This should have been how you started this issue.

BurntSushi commented 8 months ago

I think this guide might also be helpful to you: https://www.freecodecamp.org/news/asking-effective-questions-a-practical-guide-for-developers/

Bisht13 commented 8 months ago
use regex::Regex;

pub fn main() {
    // Step 1 - Find all matches of a regex in a string

    let text = "dabcdabcd";
    let re = Regex::new(r"abc").unwrap();
    let mut results = vec![];
    for mat in re.find_iter(text) {
        results.push(mat.as_str());
    }

    // Step 2 - Create a DFA that only matches the matches in the results vector

    /*
        Firstly, I get all the matches of the regex in the text,
        and stores them in the results vector. After this, I want a
        DFA which can only parse the matches in the results vector.
        No characters before or after, just the matches. I don't
        want anchored or unanchored kind since I don't want to match
        anything except the matches in the results vector, neither
        before nor after. So think of it as this way, that I just
        want to check if the DFA can parse the matches in the results
        or not. Hence for this, the DFA comes out to be
                        0 -a-> 1 -b-> 2 -c-> 3*
    */
}

I think it should be clear now.

BurntSushi commented 6 months ago

I'm going to close this because I think I've been about as helpful as I can possibly here. I do not understand the disconnect. Sorry.