Closed MegaIng closed 1 year ago
Oh, and also this a somewhat breaking change if Fsm
is considered a public interface. I decided that it's not really worth trying to have a fixup layer inside of __post_init__
I can't look at the rest of this PR right now, but to answer your question, no, Fsm
is not part of greenery
's public interface, as of version 4. (If it has some use independently of greenery
then I might consider spinning it off as a separate dedicated package, but that's a long-term future question.)
Well, if this implementation Fsm
is performant enough I wouldn't mind using it in interegular
myself. (although I currently still require compatibility back to 3.6...)
I assume Fsm
not being public means that the architectural change required for this implementation is ok? Then I would continue to work on this.
Finally found a little time to give this PR a proper look.
I think there might be a way to do this a little more simply. In my head the way we would implement this feature is to leave the Fsm
class mostly unmodified, but the "alphabet" of "symbols" it uses for transitions cease to contain individual characters (i.e. strings of length 1) and start being instances of Charclass
- or even some generic type (not sure of the precise terminology here). The logic in to_fsm
and from_fsm
would need to be modified to construct that "alphabet" of Charclass
instances, which would minimally partition the space of all possible Unicode characters. Pattern
's method matches
would also change to map any given input character to the correct Charclass
in that "alphabet". If we went in this direction, I think the need for Fsm.via_symbols
might go away, potentially along with the need for AnythingElse
.
Some potential hitches are that the behaviour of Fsm.__len__
and Fsm.strings
would change. My idea here would be to move that logic up to Pattern
instead.
I feel that what I've outlined above is fairly achievable and can be executed in discrete phases. @MegaIng what do you think of this plan? This would unfortunately potentially involve jettisoning most or all of your PR, which I tend to dislike doing unilaterally... Do you think your approach would be better?
I do mostly agree. My original motivation was to keep it possible to use the FSM
class directly without caring to much about how the alphabet looks like on the inside.
However, this gets quite complicated, more complicated than I am comfortable with.
OTOH, having only the few changes that you are suggesting means we should make sure that the alphabets of FSMs that are acted on in parallel are exactly the same, and .accepts
has to be slightly reworked. But since FSM is an internal class only, this should be fine.
The one interesting case is ofcourse .strings()
(and .__len__
, although that one can still work) which you also noticed. I assume you consider a change in behavior unacceptable.
I am willing to do this work, I would open up a new PR.
If you are able to look into this approach I would be very happy to review that PR. I was considering doing this work myself but I don't know if I could commit to any kind of timeframe.
So, to start with, would you be able to try moving the logic of Fsm.__iter__
/Fsm.strings
and Fsm.__len__
/Fsm.cardinality
up from Fsm
to the Pattern
methods of the same name? The Fsm
methods and their tests will disappear as they would no longer be required... this is not something I have a problem with. The behaviour of the Pattern
-level methods should be unchanged at this initial stage.
Hmm, this would mean that Pattern
access all the internals of FSM, and basically reimplements the logic of "accepts". While certainly an option, I am not sure if it's a good one.
I will create a PR that does that and we can see how it looks.
This is work towards fixing #81 . Most of the work for this I was doing for interegular anyway already, so it was quite a bit of copy/paste. (Note that the
Alphabet
class doesn't implement ranges right now, so it right now doesn't fix #81 at all).The core idea behind this rework is to differentiate between the inputs and the transitions by splitting up the two roles string played till now into
symbol
andevent
. TheAlphabet
class takes in symbols and maps them to events. The core functionally of this class is however the.union
operation that takes in multipleAlphabet
s and joins them together in an efficient way so that the number of events needed is minimal. Depending on the original regex(es) this can improve performance of some operations by an order of magnitude.I am not sure about the name
event
. If someone has a better idea, I am all ears. I also consideredtransition
orkey
ortransition_key
, but they all don't seem any better IMO.A lot of these changes are renaming variables or changing
Fsm(
toFsm.via_symbols(
, so this PR looks a lot larger than it actually is.TODO:
Alphabet
(based on existing convention inalphabet_test.py
Alphabet
generation into the various patternsAlphabet
to actually fix #81 (this potentially means adding another new classSymbolList
or similar)I would like some general feedback if the current path is something that has a chance to be merged.