Closed qntm closed 11 months ago
Future work: it might be possible to simply eliminate the concept of a "negated character class". Instead of ~Charclass("a")
, which internally stores ord_ranges
of ((97, 97),)
and a self.negated
flag, we could internally store ord_ranges
of ((0, 96), (98, 1114111))
and scrap the flag. This could potentially simplify a great deal of logic.
Part of #81.
Fsm
no longer tolerates missing states or transitions - every state, including an "oblivion" state if desired, must be explicitly part of the set of states provided, and every transition must also be provided. It no longer raisesOblivionError
s or toleratesANYTHING_ELSE
. This was done to make the rest of this logic less agonisingly painful.Fsm
now requires that states always be integers and that its alphabet consist ofCharclass
es - it no longer tolerates strings or other values as symbols. This had some ramifications for testing. Note that this rules outANYTHING_ELSE
as a possible symbol -ANYTHING_ELSE
has been removed entirely.Fsm
is no longer intended to serve any generic finite state machine functionality and is instead specifically dedicated to handling strings for regular expressions.Fsm
andCharclass
has been inverted. Previously,Charclass
had ato_fsm
method. Now,Fsm
has afrom_charclass
static function.Fsm.derive
,Fsm.accepts
andFsm.strings
still accept/return strings (Python values of typestr
), not sequences ofCharclass
es.Fsm
now additionally requires that its alphabet ofCharclass
es fully partition the space of all possible Unicode characters. This means that instead ofANYTHING_ELSE
, it requires some kind of negatedCharclass
. A sample alphabet is(Charclass("a"), Charclass("b"), ~Charclass("ab"))
.Fsm
essentially has the same "alphabet", we no longer need to gather or unify the set of all in-use characters from regular expression elements when constructing thoseFsm
s. All of thosealphabet()
methods are now gone.epsilon(alphabet)
andnull(alphabet)
can simply become constantsEPSILON
andNULL
.Charclass
function,repartition
, for rewriting those alphabets ofCharclass
es, andFsm
has a new methodreplace_alphabet
- this is used during manipulations in order to unify alphabets among disparateFsm
s and make it possible to combine them with relative ease.All of the above makes it possible for an
Fsm
over a relatively large collection of characters to do so by making use of only a relatively small collection of individualCharclass
symbols. For example,[\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]
is now a single symbol, and anFsm
making use of it will have a single transition for that symbol.So far so good. All of that paves the way for the next part:
Charclass
no longer stores a collection of characters. Instead it stores a list of "ord ranges", which are inclusive ranges of Unicode character numbers.Charclass("abce")
, for example, stores((97, 99), (101, 101))
. Some sophisticated, moderately efficient methodsnegate
andadd_ord_range
have been added to make it possible to sanely manage large collections of these ranges, maintaining their sequence, merging or separating them when appropriate.Charclass
is much simplified.Charclass
es is also modified slightly.charclass.chars
orcharclass.ord_ranges
- instead we use new helper methodsget_chars
,num_chars
andaccepts
to determine what theCharclass
has inside of it.[1a\\D]
do still work.All of this in turn means that a
Charclass
like[\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]
no longer instantiates achars
collection with over 1,000,000 individual characters in it. Instead it maintains just a few ranges internally, and it can be combined with otherCharclass
es relatively efficiently.This was a total nightmare taking multiple solid days of work. I decided there was no way to do this piecemeal, it had to be done all in one shot. I'm likely to spend a little while longer looking over this code to see if it can be improved, and I expect folks might want to lint it a little. There may be lingering performance hangups for these nasty cases, but I tackled all the obvious stuff.
The public API of
greenery
is unchanged. This is essentially a performance uplift.