qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
311 stars 40 forks source link

Sorting on alphabet set fails on unpickled fsm object #64

Closed oliverhaagh closed 2 years ago

oliverhaagh commented 2 years ago

Hi, I have found that many of the functions of the fsm class will fail on an unpickled fsm object. Below is an example of checking if one regex is a subset of another both before and after pickling and unpickling two fsm objects.

import pickle
from greenery import fsm, lego

r1 = "[A-Za-z0-9]{0,4}"
r2 = "[A-Za-z0-9]{0,3}"

l1: lego.lego = lego.parse(r1)
l2: lego.lego = lego.parse(r2)

f1: fsm.fsm = l1.to_fsm()
f2: fsm.fsm = l2.to_fsm()

if f2 < f1:
    print("r2 is a proper subset of r1")
else:
    print("r2 is NOT a proper subset of r1")

with open("/tmp/f1.bin", "wb") as f:
    pickle.dump(f1, f)

with open("/tmp/f2.bin", "wb") as f:
    pickle.dump(f2, f)

with open("/tmp/f1.bin", "rb") as f:
    f1_unpickled: fsm.fsm = pickle.load(f)

with open("/tmp/f2.bin", "rb") as f:
    f2_unpickled: fsm.fsm = pickle.load(f)

if f2_unpickled < f1_unpickled:
    print("r2 is a proper subset of r1")
else:
    print("r2 is NOT a proper subset of r1")

In the first if-statement it will correctly print out r2 is a proper subset of r1, but in the second one it will fail with the following traceback:

Traceback (most recent call last):
  File "/home/test/test/test.py", line 33, in <module>
    if f2_unpickled < f1_unpickled:
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 615, in __lt__
    return self.ispropersubset(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 608, in ispropersubset
    return self <= other and self != other
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 601, in __le__
    return self.issubset(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 594, in issubset
    return (self - other).empty()
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 542, in __sub__
    return self.difference(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 539, in difference
    return parallel(fsms, lambda accepts: accepts[0] and not any(accepts[1:]))
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 757, in parallel
    return crawl(alphabet, initial, final, follow).reduce()
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 782, in crawl
    for symbol in sorted(alphabet, key=key):
TypeError: '<' not supported between instances of 'anything_else_cls' and 'str'

As can be seen it seems to be caused by not being able to sort the alphabet set because it cannot compare instances of anything_else_cls and str. I have found that casting the symbol variable to a string in the key function like below will fix the issue, but I don't know if it is the correct way to do it?

def key(symbol):
    '''Ensure `fsm.anything_else` always sorts last'''
    return (symbol is anything_else, str(symbol))
qntm commented 2 years ago

It looks like the fsm.anything_else constant does not survive pickling and unpickling intact. Here is a much shorter reproduction of the issue:

import pickle
from greenery import fsm

print(fsm.anything_else is pickle.loads(pickle.dumps(fsm.anything_else)))
# False

I don't know a whole lot about pickling in Python. What is the reason for doing this? Is this a showstopper problem for some reason? It may be fixable but I would need to think about it.

oliverhaagh commented 2 years ago

Hi @qntm thanks for the reply! Yes, that is indeed a much simpler way of showing the issue! My rationale for wanting to pickle and unpickle fsm objects is because I am trying to make a service where you can register regexes derived from abnf syntax definitions and check that they are subsets of other regexes derived from abnf syntax definitions. Because these regexes can be quite complicated the creation of fsm objects from these regexes can easily take a minute or even more, so in order to not having to create these fsm objects every time I restart the service I want to cache the objects in something like Redis or just on the filesystem which both requires that you are able to pickle and unpickle objects. I know that my use case is very specific, but I could imagine that other people might also have use cases where they would want to cache fsm objects.

Also, I just realized that my proposed change to the key function might actually not work as intended, as symbol is anything_else will still not evaluate to True which you of course want to make sure that anything_else is sorted last. A solution which I believe would work better is as follows:

 def key(symbol):
    '''Ensure `fsm.anything_else` always sorts last'''
    return (isinstance(symbol, anything_else_cls), str(symbol))

Again, I don't know if my use case might be too niche, but this is also just some food-for-thought.

qntm commented 2 years ago

This should be fixed in greenery 3.3.7.

oliverhaagh commented 2 years ago

That's awesome! Thank you very much! 💯