qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
311 stars 40 forks source link

Stop instantiating character ranges #99

Closed qntm closed 11 months ago

qntm commented 11 months ago

Part of #81.

All of the above makes it possible for an Fsm over a relatively large collection of characters to do so by making use of only a relatively small collection of individual Charclass symbols. For example, [\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF] is now a single symbol, and an Fsm making use of it will have a single transition for that symbol.

So far so good. All of that paves the way for the next part:

All of this in turn means that a Charclass like [\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF] no longer instantiates a chars collection with over 1,000,000 individual characters in it. Instead it maintains just a few ranges internally, and it can be combined with other Charclasses relatively efficiently.

This was a total nightmare taking multiple solid days of work. I decided there was no way to do this piecemeal, it had to be done all in one shot. I'm likely to spend a little while longer looking over this code to see if it can be improved, and I expect folks might want to lint it a little. There may be lingering performance hangups for these nasty cases, but I tackled all the obvious stuff.

The public API of greenery is unchanged. This is essentially a performance uplift.

qntm commented 11 months ago

Future work: it might be possible to simply eliminate the concept of a "negated character class". Instead of ~Charclass("a"), which internally stores ord_ranges of ((97, 97),) and a self.negated flag, we could internally store ord_ranges of ((0, 96), (98, 1114111)) and scrap the flag. This could potentially simplify a great deal of logic.