I am finally getting around to interegular using greenery.fsm.Fsm (thank you for exposing that!), but I am running into a few issues with the Charclass.
Specifically, it's doing way to much work. I would like some way to tell it "this data is valid, don't try to hard". I encountered this a few times when transitioning, but it was never an issue, it just was a few unnecessary chr calls where I already had the ordinals.
However, I now want to create charclasses for all unicode characters based on the category they belong to:
def charclass_by_category(self) -> dict[str, Charclass]:
out = {}
for i in range(0x101000):
c = chr(i)
cat = unicodedata.category(c)
if cat not in out:
out[cat] = [c]
else:
out[cat].append(c)
return {cat: Charclass(''.join(cs)) for cat, cs in out.items()}
This version takes forever because charclass.add_ord_range gets called for every character, which then rechecks all existing ranges again, despite all the new ranges only containing a single character and coming at the end of the existing ranges. This could be optimized by special casing str being passed to Charclass by sorting it and just directly constructing the ranges. This should have noticeable speed benefits for almost all users, since even just a call Charclass('abcdefghijklmnopqrstuvxyz') would benefit.
If I optimize the above to this version:
def charclass_by_category(self) -> dict[str, Charclass]:
out = {}
last = None
for i in range(0x101000):
c = chr(i)
cat = unicodedata.category(c)
if last is None:
last = cat, (c, c)
elif last[0] == cat:
last = cat, (last[1][0], c)
else:
if last[0] not in out:
out[last[0]] = [last[1]]
else:
out[last[0]].append(last[1])
last = cat, (c, c)
return {cat: Charclass(tuple(cs)) for cat, cs in out.items()}
It at least completes acceptably quickly, but add_ord_range still basically tries to repeat all my work, making up a noticeable chunk of the runtime.
My preferred solution would probably be for Charclass to turn into a pure frozen dataclass, doing minimal, if any, verification of the arguments in __init__ (or __post_init__) , and then providing separate constructors classmethods for the various input formats (str/Iterable[str], Iterable[int], Iterable[tuple[str,str]], Iterable[tuple[int, int]]), but based on my reading, that isn't really your style of coding, so I would like to hear if you have any suggestions?
I am finally getting around to
interegular
usinggreenery.fsm.Fsm
(thank you for exposing that!), but I am running into a few issues with theCharclass
.Specifically, it's doing way to much work. I would like some way to tell it "this data is valid, don't try to hard". I encountered this a few times when transitioning, but it was never an issue, it just was a few unnecessary
chr
calls where I already had the ordinals.However, I now want to create charclasses for all unicode characters based on the category they belong to:
This version takes forever because
charclass.add_ord_range
gets called for every character, which then rechecks all existing ranges again, despite all the new ranges only containing a single character and coming at the end of the existing ranges. This could be optimized by special casingstr
being passed toCharclass
by sorting it and just directly constructing the ranges. This should have noticeable speed benefits for almost all users, since even just a callCharclass('abcdefghijklmnopqrstuvxyz')
would benefit.If I optimize the above to this version:
It at least completes acceptably quickly, but
add_ord_range
still basically tries to repeat all my work, making up a noticeable chunk of the runtime.My preferred solution would probably be for
Charclass
to turn into a pure frozendataclass
, doing minimal, if any, verification of the arguments in__init__
(or__post_init__
) , and then providing separate constructorsclassmethod
s for the various input formats (str
/Iterable[str]
,Iterable[int]
,Iterable[tuple[str,str]]
,Iterable[tuple[int, int]]
), but based on my reading, that isn't really your style of coding, so I would like to hear if you have any suggestions?