qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
311 stars 40 forks source link

Support NFAs #95

Closed thomasahle closed 7 months ago

thomasahle commented 1 year ago

If I run parse('(0|1)*1'+'(0|1)'*n).to_fsm(), I get a FSM with 2^(n+1) states. This is a common issue with using DFAs over NFAs for regular expression parsing: While they support faster parsing (linear time instead of quadratic) they can take exponential space.

For my use I don't mind paying the cost of running the NFA using breadth first search. But I'd rather not accidentally allocate exponential memory and crash my program. A lot of operations, like union/concat etc. are also easier with NFAs, as you can insert epsilon edges to solve a lot of issues.

qntm commented 1 year ago

I 100% get why you would ask for this but unfortunately this would be a fundamental change to the way in which greenery currently operates. Pretty much every step in the regular-expression-to-FSM-to-regular-expression conversion process would need to be reimplemented from scratch. DFAs are a lot simpler in this matter than NFAs.

If someone has a clear plan for how this migration/upgrade could be carried out in relatively small, manageable stages, I'd definitely be interested to hear about that, and I'd review some PRs, but this isn't a change I'm planning to undertake myself at present. Thanks.

thomasahle commented 1 year ago

I wrote the code yesterday for parsing the pattern form into an NFA. I guess it doesn't completely match your high level of annotation and comments. But I did test it fairly comprehensively. I also wrote a script to convert the NFA into a dot file for graphiz to visualize, if you are interested.

thomasahle commented 1 year ago

Regarding a plan, I'm not saying you shouldn't have the dfa functionality. I think having both would make a nice graph of possible conversions:

regex  ◄────►    pattern   ◄────►    dfa  
                                      ▲
                       │              │
                       └────────►    nfa
qntm commented 1 year ago

That would be nice, but I think there's a piece missing if this is to actually be a useful improvement. In that graph, the only way to convert from an NFA back to a Pattern is via a DFA - which, as you rightly pointed out, will have a potentially exponential number of states to it. So, just adding the Pattern -> NFA line doesn't address the problem really.

What you would need is a method for converting the NFA directly back to a Pattern. For comparison, to convert the DFA back to a Pattern I used this code, which I imagine you may have already seen. I think this would be the hard part. Do you know how to do this? If so, then I think we'd be in business.

thomasahle commented 1 year ago

I understand that for this to be useful for your goal of computing intersections and similar things between regexes, we'd need functionality for doing that on the NFA directly.

However, if your goal is just "a library for parsing and manipulating regexes" maybe it's fine to just have a way to parse into an NFA?

Personally I found your library great because it's the only python regex library I could find that would give a useful graph output, rather than just parsing/matching etc.

qntm commented 1 year ago

Hmm... so, the Fsm class which performs all these manipulations isn't part of greenery's public API. This is deliberate, as I consider it an implementation detail and its API isn't particularly useful for generic FSM-related tasks. I would consider the alternative NFA-based implementation to be similar. I'm interested in the potential performance improvements. But I've never considered visualisation of the underlying state machines to be a goal of the project, for example. In fact I've considered this to be an essentially completed project for some years now.

Let me think about this. I'll get back to you.

AmBha21 commented 7 months ago

I'm curious if this issue was resolved and there was a way to go from pattern to nfa

qntm commented 7 months ago

I'm of the opinion that this functionality isn't worth adding, for the reasons I described above. It doesn't actually improve the performance of any of the operations greenery is intended to facilitate.

thomasahle commented 7 months ago

@AmBha21 If you want to use my code it's here: https://gist.github.com/thomasahle/deebec216391e916a69b5ab1f2423e91

You can test it like this:

import pytest
from nfa2 import nfa_simulate, regex_to_nfa

@pytest.mark.parametrize("regex, string, expected", [
    # Test cases for matching specific characters
    ("abc", "abc", True),
    ("abc", "abcd", False),
    ("abc", "ab", False),
    # Test cases for matching digit characters
    (r"\d+", "123", True),
    (r"\d+", "abc", False),
    (r"\d+", "123abc", False),
    # Test cases for matching word characters
    (r"\w+", "helloWorld", True),
    (r"\w+", "hello World", False),
    # Test cases for matching whitespace characters
    (r"\s+", "   ", True),
    (r"\s+", "  a ", False),
    # Test cases for matching repetition
    ("a{3}", "aaa", True),
    ("a{3}", "aa", False),
    ("a{2,4}", "aaa", True),
    ("a{2,4}", "a", False),
    # Test cases for matching character sets
    ("[aeiou]", "a", True),
    ("[aeiou]", "b", False),
    # Test cases for matching either or
    ("a|b", "a", True),
    ("a|b", "c", False),
    # Test cases for matching start and end of string
    ("a.*b", "acb", True),
    ("a.*b", "ac", False),
    # Test case for matching email addresses
    (r"[\w.\-]+@[a-zA-Z\d.\-]+\.[a-zA-Z]+", "john.doe@example.com", True),
    (r"[\w.\-]+@[a-zA-Z\d.\-]+\.[a-zA-Z]+", "john.doe.example.com", False),
    # Test case for matching hexadecimal numbers
    (r"0[xX][0-9a-fA-F]+", "0x1A3f", True),
    (r"0[xX][0-9a-fA-F]+", "0x1AGT", False),
    # Test case for matching IP addresses
    (r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)", "192.168.1.1", True),
    (r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)", "192.168.300.1", False),
    # Test case for matching dates in YYYY-MM-DD format
    (r"\d{4}-\d{2}-\d{2}", "2023-06-24", True),
    (r"\d{4}-\d{2}-\d{2}", "06-24-2023", False),
    # Test case for matching phone numbers (US format)
    (r"\(\d{3}\) \d{3}-\d{4}", "(123) 456-7890", True),
    (r"\(\d{3}\) \d{3}-\d{4}", "123-456-7890", False),
    # Test case for matching credit card numbers (16 digits)
    (r"\d{4}-\d{4}-\d{4}-\d{4}", "1234-5678-9123-4567", True),
    (r"\d{4}-\d{4}-\d{4}-\d{4}", "1234-56789123-4567", False),
    # Test case for matching URLs
    (r"https?://[^\s]+", "https://example.com", True),
    (r"https?://[^\s]+", "htp://example.com", False),
    # Test case for matching a string that doesn't contain numbers
    (r"[^\d]+", "NoNumbersHere", True),
    (r"[^\d]+", "There1sANumber", False),
    # Test case for matching strings containing only uppercase letters
    (r"[A-Z]+", "UPPERCASE", True),
    (r"[A-Z]+", "LowerCase", False),
])
def test_regex_match(regex, string, expected):
    nfa = regex_to_nfa(regex)
    assert nfa_simulate(nfa, string) == expected