qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
311 stars 40 forks source link

fsm: add translation layer between symbol and transition #82

Closed MegaIng closed 1 year ago

MegaIng commented 1 year ago

This is work towards fixing #81 . Most of the work for this I was doing for interegular anyway already, so it was quite a bit of copy/paste. (Note that the Alphabet class doesn't implement ranges right now, so it right now doesn't fix #81 at all).

The core idea behind this rework is to differentiate between the inputs and the transitions by splitting up the two roles string played till now into symbol and event. The Alphabet class takes in symbols and maps them to events. The core functionally of this class is however the .union operation that takes in multiple Alphabets and joins them together in an efficient way so that the number of events needed is minimal. Depending on the original regex(es) this can improve performance of some operations by an order of magnitude.

I am not sure about the name event. If someone has a better idea, I am all ears. I also considered transition or key or transition_key, but they all don't seem any better IMO.

A lot of these changes are renaming variables or changing Fsm( to Fsm.via_symbols(, so this PR looks a lot larger than it actually is.

TODO:

I would like some general feedback if the current path is something that has a chance to be merged.

MegaIng commented 1 year ago

Oh, and also this a somewhat breaking change if Fsm is considered a public interface. I decided that it's not really worth trying to have a fixup layer inside of __post_init__

qntm commented 1 year ago

I can't look at the rest of this PR right now, but to answer your question, no, Fsm is not part of greenery's public interface, as of version 4. (If it has some use independently of greenery then I might consider spinning it off as a separate dedicated package, but that's a long-term future question.)

MegaIng commented 1 year ago

Well, if this implementation Fsm is performant enough I wouldn't mind using it in interegular myself. (although I currently still require compatibility back to 3.6...)

I assume Fsm not being public means that the architectural change required for this implementation is ok? Then I would continue to work on this.

qntm commented 1 year ago

Finally found a little time to give this PR a proper look.

I think there might be a way to do this a little more simply. In my head the way we would implement this feature is to leave the Fsm class mostly unmodified, but the "alphabet" of "symbols" it uses for transitions cease to contain individual characters (i.e. strings of length 1) and start being instances of Charclass - or even some generic type (not sure of the precise terminology here). The logic in to_fsm and from_fsm would need to be modified to construct that "alphabet" of Charclass instances, which would minimally partition the space of all possible Unicode characters. Pattern's method matches would also change to map any given input character to the correct Charclass in that "alphabet". If we went in this direction, I think the need for Fsm.via_symbols might go away, potentially along with the need for AnythingElse.

Some potential hitches are that the behaviour of Fsm.__len__ and Fsm.strings would change. My idea here would be to move that logic up to Pattern instead.

I feel that what I've outlined above is fairly achievable and can be executed in discrete phases. @MegaIng what do you think of this plan? This would unfortunately potentially involve jettisoning most or all of your PR, which I tend to dislike doing unilaterally... Do you think your approach would be better?

MegaIng commented 1 year ago

I do mostly agree. My original motivation was to keep it possible to use the FSM class directly without caring to much about how the alphabet looks like on the inside.

However, this gets quite complicated, more complicated than I am comfortable with.

OTOH, having only the few changes that you are suggesting means we should make sure that the alphabets of FSMs that are acted on in parallel are exactly the same, and .accepts has to be slightly reworked. But since FSM is an internal class only, this should be fine.

The one interesting case is ofcourse .strings() (and .__len__, although that one can still work) which you also noticed. I assume you consider a change in behavior unacceptable.

I am willing to do this work, I would open up a new PR.

qntm commented 1 year ago

If you are able to look into this approach I would be very happy to review that PR. I was considering doing this work myself but I don't know if I could commit to any kind of timeframe.

So, to start with, would you be able to try moving the logic of Fsm.__iter__/Fsm.strings and Fsm.__len__/Fsm.cardinality up from Fsm to the Pattern methods of the same name? The Fsm methods and their tests will disappear as they would no longer be required... this is not something I have a problem with. The behaviour of the Pattern-level methods should be unchanged at this initial stage.

MegaIng commented 1 year ago

Hmm, this would mean that Pattern access all the internals of FSM, and basically reimplements the logic of "accepts". While certainly an option, I am not sure if it's a good one.

I will create a PR that does that and we can see how it looks.