Closed anderleich closed 3 years ago
There's no notion of path length in OpenFst or Pynini, so there's no notion of longest match. Even "shortest path" is not based on the length of the path but rather the weights of a path. You probably want to attach weights to the elements of your string_map
(the optional third argument in each tuple is a weight) to favor one or the other.
Alternatively, you could:
one_top_rewrite
, use top_rewrite
(which only gives you one but uses an implementation-defined way to resolve ties which you may or may not like)one_top_rewrite
, use rewrites
(which gives you a list of all possible rewrites)PS: I don't fully understand the snippet here since I can't infer the definition of _normalize
. (Is that a typo for _lambda
?)
Ok, thanks! I gave negative weights according to word length and now I get the longest match.
One more thing. If the string_map
is long, say 1000 words, cdrewrite
is slow and it needs a lot of RAM memory. Is there a way to improve performance. Even if a read them from a file...
PS: I don't fully understand the snippet here since I can't infer the definition of _normalize. (Is that a typo for _lambda?)
Yes, it is a typo for _lambda
Ok, thanks! I gave negative weights according to word length and now I get the longest match.
Great, I'd recommend something like that.
One more thing. If the
string_map
is long, say 1000 words,cdrewrite
is slow and it needs a lot of RAM memory. Is there a way to improve performance. Even if a read them from a file...
This is a known efficiency issue with (any) context-dependent rewrite formalism. The cdrewrite transducer is built from several component automata, all whose construction is itself reasonably efficient, except that the rules which introduce "markers" (i.e., delimit where the rule begins and ends) require determinzation of the input projection of the acceptors. (Here begins a bit from our in-production book on Pynini, coming out this summer.)
*Consider a transducer which inserts a marker # after the regular language $L$; this corresponds loosely to several of the constituent transducers used by the Mohri & Sproat construction. In order to do this we need to construct a finite transducer that allows any sequence of characters in $\Sigma^$, possibly including previous instances of $L$, and inserts # after $L$. Such a transducer is obtained by inserting an arc with an $\epsilon$ input label and a # output label after the final state(s) of $L$ and marking the new destination state final. This requires that the input acceptor must be deterministic so that for any string $l \in L$, there is but one path through the automaton; otherwise the string could bypass insertion of #. As mentioned above, determinization can be quite expensive, particularly if $\Sigma$ is large or if $L$ is complex. So, for example, if one would like to write a rule that inserts a given word after a set of words, and this latter set of words numbers in the tens of thousands, then the compilation of said rule will be somewhat slow and the resulting FST quite large. Fortunately there is usually a way around such problems. In the hypothetical example above, one could arrange for all words in one's list of interest to be pre-tagged already with some marker, and condition the rule's environment on that marker alone. This will result in more efficient compilation and application, and produce a more compact rule transducer.**
I see...
My goal is to normalize numbers, abbreviations, acronyms in random text strings, such as:
Today is 25th of January --> Today is twenty fifth of January.
I already have some transducers for each task. These, however, only consider valid input alone, not random strings. That's why I did this:
normalizer = (
pynini.cdrewrite(_num2str, "", "", sigma_star)
@ pynini.cdrewrite(_normalize_abbreviations, "", "", sigma_star)
@ pynini.cdrewrite(_normalize_acronyms, "", "", sigma_star)
).optimize()
Abbreviations is 1000 word long and acronyms 100 words long. I expect these to expand with time.
As you mentioned, this is rather slow with so many words. Is there any better way to achieve my goal?
If it takes a while but the resulting transducer is still useful for your purposes, I would just compute it offline, taking the one-time hit for determinization, and serialize the results, instead of trying to optimize the construction itself. There's an included library pynini.export
which gives you nice decorators for storing a collection of FSTs this way.
Sorry, I don't see how to do it. Do you mean saving the FST to file?
Yes. Just do the computation offline and write to disk. FST objects have a .write method, or you can group a collection of them together and store them in a FAR file.
On Fri, Apr 23, 2021 at 8:21 AM anderleich @.***> wrote:
Sorry, I'm don't see how to do it. Do you mean saving the FST to file?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-825620063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OLUYECTW6OTOQ5436DTKFQ5BANCNFSM43MG7BHQ .
Ok, great. How would I process a random string after that? Read the FST from file and the proceed as usual?
Yes, that's right. No information will be lost.
K
On Fri, Apr 23, 2021 at 8:25 AM anderleich @.***> wrote:
Ok, great. How would I process a random string after that? Read the FST from file and the proceed as usual?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-825621911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OJJHQUYWSUHUM53LPLTKFRKTANCNFSM43MG7BHQ .
Thanks! I works!
This is what I've managed to build to expand email addresses.
Why is this needing so much time and memory consuming? The string_map
s are short. Are closure
s causing it?
_az = pynini.union(*string.ascii_lowercase)
_AZ = pynini.union(*string.ascii_uppercase)
_digits = pynini.union(*string.digits)
_punctuation = pynini.string_map([
("_", "underscore"),
("-", "dash"),
("+", "plus"),
(".", "dot")
])
_punctuation = pynutil.insert(" ") + _punctuation + pynutil.insert(" ")
_alphanum = pynini.union(_az, _AZ, _digits)
_local_part = _alphanum + pynini.union(_alphanum, _punctuation).closure(0,63)
_domain_part = pynini.union(_alphanum, pynini.string_map([("-", " dash "), (".", " dot ")])).closure()
#_domain_part_star = pynutil.join(_domain_part, pynini.cross(".", " dot ")).optimize()
_at = pynutil.insert(" ") + pynini.cross("@", "at") + pynutil.insert(" ")
_email = (_local_part + _at + _domain_part).optimize()
sigma_star = pynini.closure(byte.BYTE)
_email_star = pynini.cdrewrite(_email, "", "", sigma_star).optimize()
I'd guess 0, 63. That's going to give you a huge automaton. Why not just replace that with *?
On Tue, Apr 27, 2021 at 10:46 AM anderleich @.***> wrote:
This is what I've managed to build to expand email addresses.
Why is this needing so much time and memory consuming? The string_maps are short. Are closures causing it?
_az = pynini.union(string.ascii_lowercase)_AZ = pynini.union(string.ascii_uppercase)_digits = pynini.union(*string.digits)_punctuation = pynini.stringmap([ ("", "underscore"), ("-", "dash"), ("+", "plus"), (".", "dot") ])_punctuation = pynutil.insert(" ") + _punctuation + pynutil.insert(" ")_alphanum = pynini.union(_az, _AZ, _digits) _local_part = _alphanum + pynini.union(_alphanum, _punctuation).closure(0,63)_domain_part = pynini.union(_alphanum, pynini.string_map([("-", " dash "), (".", " dot ")])).closure()#_domain_part_star = pynutil.join(_domain_part, pynini.cross(".", " dot ")).optimize() _at = pynutil.insert(" ") + pynini.cross("@", "at") + pynutil.insert(" ") _email = (_local_part + _at + _domain_part).optimize() sigma_star = pynini.closure(byte.BYTE)_email_star = pynini.cdrewrite(_email, "", "", sigma_star).optimize()
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-827664262, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKI6I6OD2PFFRDQZQ3TK3E55ANCNFSM43MG7BHQ .
That was it, thanks!
So, it is better not to constrain local_part to be 64 characters long at maximum, isn't it?
If it's not absolutely essential to your application and you don't want to wait for cdrewrite compilation to terminate, that's what I'd do.
On Tue, Apr 27, 2021 at 11:16 AM anderleich @.***> wrote:
That was it, thanks!
So, it is better not to constrain local_part to be 64 characters long at maximum, isn't it?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-827690592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OIZDQMUHWY5B6MT6LLTK3INJANCNFSM43MG7BHQ .
Hi,
I facing some trouble while expanding acronyms. Let's say I have the following code:
I get:
Any clues? How can I get the longest match?