microsoft / Trill

Trill is a single-node query processor for temporal or streaming data.
MIT License
1.25k stars 133 forks source link

Fix zero or one when pattern contains loop back to start #144

Closed peterfreiling closed 3 years ago

peterfreiling commented 3 years ago

Currently, the Regex ZeroOrOne operator works by modifying the pattern's state machine to add its start state as another final state. This immediate final state covers the "zero" case. However, when the pattern's state machine has the potential to loop back to its start state, this can result in producing multiple matches solely for the "zero" case, which is incorrect. To fix this, ZeroOrOne will now detect when the pattern contains a loop back to start, and in this case, prepend a simple epsilon arc to the pattern, setting the starting state as zero, guaranteeing there is no loop back to this new start state. Taking the added test case in this PR as an example, consider the pattern "(A*F)", which results in a state machine of:

             "A"
           <----
(start) 0         1  ----> 2 (final)
           ---->      "F"
          epsilon

Before this fix, wrapping this pattern using a ZeroOrOne operator, i.e. "(A*F)?", simply marked the starting state as final, resulted in a state machine of:

                   "A"
                  <----
(start, final) 0         1  ----> 2 (final)
                  ---->       "F"
                 epsilon

E.g., with an input pattern of "OAAFO", this would result in matches at index 0 (correct), 1 (incorrect), 2 (incorrect), 3 (correct), and 4 (correct). With this fix, ZeroOrOne will detect these loops, and instead prepend an epsilon arc before the pattern, producing the correct output:

                             "A"
                            <----
(start, final ) 0  ----> 1         2  ----> 3 (final)
                  epsilon   ---->       "F"
                           epsilon
badrishc commented 3 years ago

Good catch. I guess the attempt to short circuit this by setting start to final, does not really work in the general case. Correctness first.