kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
120 stars 26 forks source link

Question: support for direct string-based regex->fst support #4

Closed AdolfVonKleist closed 4 years ago

AdolfVonKleist commented 6 years ago

I've been playing with this a bit and really liking it.

One thing I am curious about and cannot seem to find in the docs or CLI help is whether or not it is possible to directly compile a compliant regular expression directly into an FSA?

I'm thinking something like:

import pynini

regex_fsa = pynini.acceptor ("^xyZ[1-3]+4$")
# Now regex_fsa can be used to match strings or compiled into a cascade

Is this possible without manually building up the individual expressions?

Apologies if this was better sent to the TWiki.

kylebgorman commented 6 years ago

No, there's no support for that kind of regexp syntax, at least not yet. We could do a subset of that (basically everything except backtraces) though.

The one thing to note is that the edge delimiters "^" and "$" don't exist in our world---though you can hallucinate them, say by appending "^" and "$" to your input strings too, and there is [BOS] and [EOS] supported by CDRewrite.

On Wed, May 2, 2018 at 12:20 PM Josef Novak notifications@github.com wrote:

I've been playing with this a bit and really liking it.

One thing I am curious about and cannot seem to find in the docs or CLI help is whether or not it is possible to directly compile a compliant regular expression directly into an FSA?

I'm thinking something like:

import pynini

regex_fsa = pynini.acceptor ("^xyZ[1-3]+4$")

Now regex_fsa can be used to match strings or compiled into a cascade

Is this possible without manually building up the individual expressions?

Apologies if this was better sent to the TWiki.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/Pynini/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuObxYLmypQUjpqooO8shO9FSbSRZgks5tudzlgaJpZM4Tvwk5 .

jmcotelo commented 4 years ago

I've been working in a project which kinda parses a subset of regex and builds FSA from it. I've managed to handle conversion of character classes (\w, \d, \s, etc...) in by using string_map over unicode codepoints.

My main concern is the common /./ Is there other way of building an acceptor that consume any char without relying the same approach? Making a string_map on all possible codepoints seems a bit excessive to me.

Cheers, and i love this project :)

kylebgorman commented 4 years ago

Short answer, no, not without having some finite set $\Sigma$ that '.' quantifies over. You could just restrict yourself to bytes, and then that will work 99% of the time?

Thanks for your input and positive feedback! ;)

On Thu, Feb 27, 2020 at 5:06 PM Juan Manuel Cotelo notifications@github.com wrote:

I've been working in a project which kinda parses a subset of regex and builds FSA from it. I've managed to handle conversion of character classes (\w, \d, \s, etc...) in by using string_map over unicode codepoints.

My main concern is the common /./ Is there other way of building an acceptor that consume any char without relying the same approach? Making a string_map on all possible codepoints seems a bit excessive to me.

Cheers, and i love this project :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/4?email_source=notifications&email_token=AABG4ONP7KPDUKHUJLK3BRLRFA2NLA5CNFSM4E57BE42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENGEJCY#issuecomment-592200843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKZRCC6OHBSS3M3E6TRFA2NLANCNFSM4E57BE4Q .