Open mikewardle opened 5 years ago
Could this be that the IsEmpty property is never set?
I have tried changing to use BasicOperations.IsEmpty(i2) and now it finds empty intersections for cases I would not expect it to for example
var logger = new Mock<ILog>();
var sut = new RegexOverlapChecker(x => logger.Object);
var res = sut.IsNotOrthogonal(@"Easy Access Saver", @"Easy Access Saver Reward");
Assert.IsTrue(res);
This now appears to be down to the matching of the regular expressions themselves, running:
var a = new RegExp("Easy Access Saver").ToAutomaton();
var isMatch = a.Run("Easy Access Saver Reward");
returns false! There is clearly a problem with my understanding of the way the automaton are functioning in Fare.
incidentally using .net framework native regex:
var isMatch = new Regex("Easy Access Saver").IsMatch("Easy Access Saver Reward");
gives true.
@mikewardle, thank you for reporting this. Feel free to send over a pull request if you think you know a reasonable way to solve this.
Project Fare turns Regular Expressions into Automatons by applying the algorithms of dk.brics.automaton and xeger.
Unfortunately, I don't have an answer to your question, as Project Fare is really a port of the above Java projects.
You may use a different pattern or use a different engine to reverse the Regular Expression into an Automaton. As an example, you can use the Rex engine.
You are more than welcome to troubleshoot/debug this and open a pull request, however.
@mikewardle I've been interested in a similar problem today. First off,
var a = new RegExp("Easy Access Saver").ToAutomaton();
var isMatch = a.Run("Easy Access Saver Reward");
returns false
because the automaton matches the string exactly. AFAIK regular expression Easy Access Saver
in this library should be equivalent to ^Easy Access Server$
in C#. You can easily fix this by using a different expression: .*Easy Access Server.*
. This will try to match Easy Access Server
as a substring.
Now, to the original question. I've only found this library today so correct me if I'm wrong. The IsEmpty
property does not appear to be set anywhere in the code. However, language accepted by a DFA is empty if and only if there are no accepting states reachable from the initial state. The Automaton.GetAcceptStates()
method will give you the set of reachable accepting states. You can easily check whether the intersection of 2 regular expressions is empty using this:
bool IsIntersectionEmpty(RegExp first, RegExp second)
{
var firstAutomaton = first.ToAutomaton();
var secondAutomaton = second.ToAutomaton();
return firstAutomaton.Intersection(secondAutomaton).GetAcceptStates().Count == 0;
}
As a note, you don't have to explicitly build the intersection DFA. You can just traverse states of the product automaton.
@trylock, very nice! Yes, besides Xeger there's also all the other great stuff from dk.brics.automaton that's already exposed in this library, e.g. ToAutomaton
.
In fact, Xeger does not support all valid Java regular expressions. The full set of what is defined here and is summarized below:
regexp ::= unionexp
|
unionexp ::= interexp | unionexp (union) | interexp
interexp ::= concatexp & interexp (intersection) [OPTIONAL] | concatexp
concatexp ::= repeatexp concatexp (concatenation) | repeatexp
repeatexp ::= repeatexp ? (zero or one occurrence)
| repeatexp * (zero or more occurrences)
| repeatexp + (one or more occurrences)
| repeatexp {n} (n occurrences) | repeatexp {n,} (n or more occurrences) | repeatexp {n,m} (n to m occurrences, including both)
| complexp
complexp ::= ~ complexp (complement) [OPTIONAL] | charclassexp
charclassexp ::= [ charclasses ] (character class)
| [^ charclasses ] (negated character class)
| simpleexp
charclasses ::= charclass charclasses
| charclass
charclass ::= charexp - charexp (character range, including end-points) | charexp
simpleexp ::= charexp
| . (any single character)
| # (the empty language) [OPTIONAL] | @ (any string) [OPTIONAL] | " <Unicode string without double-quotes> " (a string)
| ( ) (the empty string)
| ( unionexp ) (precedence override)
| < <identifier> > (named automaton) [OPTIONAL] | <n-m> (numerical interval) [OPTIONAL] charexp ::= <Unicode character> (a single non-reserved character)
| \ <Unicode character> (a single character)
I am trying to write a method that will detect if there is NO intersection between 2 regular expressions (as in this answer on stack overflow for java: https://stackoverflow.com/a/17957180/2987400 )
but it never suggests that the intersection is empty.
Here is my code:
and some a unit test that fails when I expect it to pass:
Any help on this would be greatly appreciated