Some data storage systems store records in name order, making it efficient to retrieve records with names in a given range. When filtering such records by regular expression, it might therefore be useful to have a way to generate a range in which all the results are guaranteed lie, to use as a pre-filter.
For example, any match of the regular expression (moomin|troll)+ must lie in the range ("moomin", "trolltrolm"). Note that the last character of the upper bound has been 'rounded up' to ensure that every match, no matter how long, lies before it.
This should be fairly straightforward to implement for a DFA, and only slightly more complex for a NFA (assuming it's been trimmed of any states that do not lead to a match). The interface will need to specify the maximum length of the bounds. The ordering can be lexicographic order on the byte encodings, which in the case of UTF8 also conveniently matches the ordering on Unicode code points. Still, there are a couple of potential UTF8 niggles:
the maximum length might lie inside a multi-byte character
rounding a truncated upper bound to the next byte might result in an invalid byte sequence
The easiest approach is probably to remain encoding-agnostic and simply ignore this, returning byte sequence bounds and letting the client library deal with converting those to valid strings.
This is really interesting, I'm not familiar with the theory here. Can you explain how you find the string trolltrolm from the minimal DFA for your example please?
Some data storage systems store records in name order, making it efficient to retrieve records with names in a given range. When filtering such records by regular expression, it might therefore be useful to have a way to generate a range in which all the results are guaranteed lie, to use as a pre-filter.
For example, any match of the regular expression
(moomin|troll)+
must lie in the range ("moomin", "trolltrolm"). Note that the last character of the upper bound has been 'rounded up' to ensure that every match, no matter how long, lies before it.This should be fairly straightforward to implement for a DFA, and only slightly more complex for a NFA (assuming it's been trimmed of any states that do not lead to a match). The interface will need to specify the maximum length of the bounds. The ordering can be lexicographic order on the byte encodings, which in the case of UTF8 also conveniently matches the ordering on Unicode code points. Still, there are a couple of potential UTF8 niggles:
The easiest approach is probably to remain encoding-agnostic and simply ignore this, returning byte sequence bounds and letting the client library deal with converting those to valid strings.