qntm / greenery

Regular expression manipulation library
http://qntm.org/greenery
MIT License
311 stars 40 forks source link

Allow faster `Charclass` instatiation #109

Closed MegaIng closed 1 month ago

MegaIng commented 1 month ago

I am finally getting around to interegular using greenery.fsm.Fsm (thank you for exposing that!), but I am running into a few issues with the Charclass.

Specifically, it's doing way to much work. I would like some way to tell it "this data is valid, don't try to hard". I encountered this a few times when transitioning, but it was never an issue, it just was a few unnecessary chr calls where I already had the ordinals.

However, I now want to create charclasses for all unicode characters based on the category they belong to:

def charclass_by_category(self) -> dict[str, Charclass]:
    out = {}
    for i in range(0x101000):
        c = chr(i)
        cat = unicodedata.category(c)
        if cat not in out:
            out[cat] = [c]
        else:
            out[cat].append(c)
    return {cat: Charclass(''.join(cs)) for cat, cs in out.items()}

This version takes forever because charclass.add_ord_range gets called for every character, which then rechecks all existing ranges again, despite all the new ranges only containing a single character and coming at the end of the existing ranges. This could be optimized by special casing str being passed to Charclass by sorting it and just directly constructing the ranges. This should have noticeable speed benefits for almost all users, since even just a call Charclass('abcdefghijklmnopqrstuvxyz') would benefit.

If I optimize the above to this version:

def charclass_by_category(self) -> dict[str, Charclass]:
    out = {}
    last = None
    for i in range(0x101000):
        c = chr(i)
        cat = unicodedata.category(c)
        if last is None:
            last = cat, (c, c)
        elif last[0] == cat:
            last = cat, (last[1][0], c)
        else:
            if last[0] not in out:
                out[last[0]] = [last[1]]
            else:
                out[last[0]].append(last[1])
            last = cat, (c, c)
    return {cat: Charclass(tuple(cs)) for cat, cs in out.items()}

It at least completes acceptably quickly, but add_ord_range still basically tries to repeat all my work, making up a noticeable chunk of the runtime.

My preferred solution would probably be for Charclass to turn into a pure frozen dataclass, doing minimal, if any, verification of the arguments in __init__ (or __post_init__) , and then providing separate constructors classmethods for the various input formats (str/Iterable[str], Iterable[int], Iterable[tuple[str,str]], Iterable[tuple[int, int]]), but based on my reading, that isn't really your style of coding, so I would like to hear if you have any suggestions?

qntm commented 1 month ago

Fixed in greenery-4.2.2.