kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
122 stars 27 forks source link

Multiple top rewrites found #41

Closed anderleich closed 3 years ago

anderleich commented 3 years ago

Hi,

I facing some trouble while expanding acronyms. Let's say I have the following code:

# sigma_star is: a-zA-Z0-9 + punctuation
_lambda = pynini.string_map([("AE", "a e"), ("AEB", "a e be")])
_acronyms = pynini.cdrewrite(_normalize, "", "", sigma_star).optimize()
rewrite.one_top_rewrite("AE", _acronyms)

I get:

pynini.lib.rewrite.Error: Multiple top rewrites found: a e and a e be (weight: 0)

Any clues? How can I get the longest match?

kylebgorman commented 3 years ago

There's no notion of path length in OpenFst or Pynini, so there's no notion of longest match. Even "shortest path" is not based on the length of the path but rather the weights of a path. You probably want to attach weights to the elements of your string_map (the optional third argument in each tuple is a weight) to favor one or the other.

Alternatively, you could:

  1. Don't use one_top_rewrite, use top_rewrite (which only gives you one but uses an implementation-defined way to resolve ties which you may or may not like)
  2. Don't use one_top_rewrite, use rewrites (which gives you a list of all possible rewrites)

PS: I don't fully understand the snippet here since I can't infer the definition of _normalize. (Is that a typo for _lambda?)

anderleich commented 3 years ago

Ok, thanks! I gave negative weights according to word length and now I get the longest match.

One more thing. If the string_map is long, say 1000 words, cdrewrite is slow and it needs a lot of RAM memory. Is there a way to improve performance. Even if a read them from a file...

PS: I don't fully understand the snippet here since I can't infer the definition of _normalize. (Is that a typo for _lambda?)

Yes, it is a typo for _lambda

kylebgorman commented 3 years ago

Ok, thanks! I gave negative weights according to word length and now I get the longest match.

Great, I'd recommend something like that.

One more thing. If the string_map is long, say 1000 words, cdrewrite is slow and it needs a lot of RAM memory. Is there a way to improve performance. Even if a read them from a file...

This is a known efficiency issue with (any) context-dependent rewrite formalism. The cdrewrite transducer is built from several component automata, all whose construction is itself reasonably efficient, except that the rules which introduce "markers" (i.e., delimit where the rule begins and ends) require determinzation of the input projection of the acceptors. (Here begins a bit from our in-production book on Pynini, coming out this summer.)

*Consider a transducer which inserts a marker # after the regular language $L$; this corresponds loosely to several of the constituent transducers used by the Mohri & Sproat construction. In order to do this we need to construct a finite transducer that allows any sequence of characters in $\Sigma^$, possibly including previous instances of $L$, and inserts # after $L$. Such a transducer is obtained by inserting an arc with an $\epsilon$ input label and a # output label after the final state(s) of $L$ and marking the new destination state final. This requires that the input acceptor must be deterministic so that for any string $l \in L$, there is but one path through the automaton; otherwise the string could bypass insertion of #. As mentioned above, determinization can be quite expensive, particularly if $\Sigma$ is large or if $L$ is complex. So, for example, if one would like to write a rule that inserts a given word after a set of words, and this latter set of words numbers in the tens of thousands, then the compilation of said rule will be somewhat slow and the resulting FST quite large. Fortunately there is usually a way around such problems. In the hypothetical example above, one could arrange for all words in one's list of interest to be pre-tagged already with some marker, and condition the rule's environment on that marker alone. This will result in more efficient compilation and application, and produce a more compact rule transducer.**

anderleich commented 3 years ago

I see...

My goal is to normalize numbers, abbreviations, acronyms in random text strings, such as:

Today is 25th of January --> Today is twenty fifth of January.

I already have some transducers for each task. These, however, only consider valid input alone, not random strings. That's why I did this:

normalizer = (
    pynini.cdrewrite(_num2str, "", "", sigma_star)
    @ pynini.cdrewrite(_normalize_abbreviations, "", "", sigma_star)
    @ pynini.cdrewrite(_normalize_acronyms, "", "", sigma_star)
).optimize()

Abbreviations is 1000 word long and acronyms 100 words long. I expect these to expand with time.

As you mentioned, this is rather slow with so many words. Is there any better way to achieve my goal?

kylebgorman commented 3 years ago

If it takes a while but the resulting transducer is still useful for your purposes, I would just compute it offline, taking the one-time hit for determinization, and serialize the results, instead of trying to optimize the construction itself. There's an included library pynini.export which gives you nice decorators for storing a collection of FSTs this way.

anderleich commented 3 years ago

Sorry, I don't see how to do it. Do you mean saving the FST to file?

kylebgorman commented 3 years ago

Yes. Just do the computation offline and write to disk. FST objects have a .write method, or you can group a collection of them together and store them in a FAR file.

On Fri, Apr 23, 2021 at 8:21 AM anderleich @.***> wrote:

Sorry, I'm don't see how to do it. Do you mean saving the FST to file?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-825620063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OLUYECTW6OTOQ5436DTKFQ5BANCNFSM43MG7BHQ .

anderleich commented 3 years ago

Ok, great. How would I process a random string after that? Read the FST from file and the proceed as usual?

kylebgorman commented 3 years ago

Yes, that's right. No information will be lost.

K

On Fri, Apr 23, 2021 at 8:25 AM anderleich @.***> wrote:

Ok, great. How would I process a random string after that? Read the FST from file and the proceed as usual?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-825621911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OJJHQUYWSUHUM53LPLTKFRKTANCNFSM43MG7BHQ .

anderleich commented 3 years ago

Thanks! I works!

anderleich commented 3 years ago

This is what I've managed to build to expand email addresses.

Why is this needing so much time and memory consuming? The string_maps are short. Are closures causing it?

_az = pynini.union(*string.ascii_lowercase)
_AZ = pynini.union(*string.ascii_uppercase)
_digits = pynini.union(*string.digits)
_punctuation = pynini.string_map([
    ("_", "underscore"),
    ("-", "dash"),
    ("+", "plus"),
    (".", "dot")
])
_punctuation = pynutil.insert(" ") + _punctuation + pynutil.insert(" ")
_alphanum = pynini.union(_az, _AZ, _digits)

_local_part = _alphanum + pynini.union(_alphanum, _punctuation).closure(0,63)
_domain_part = pynini.union(_alphanum, pynini.string_map([("-", " dash "), (".", " dot ")])).closure()
#_domain_part_star = pynutil.join(_domain_part, pynini.cross(".", " dot ")).optimize()

_at = pynutil.insert(" ") + pynini.cross("@", "at") + pynutil.insert(" ")

_email = (_local_part + _at + _domain_part).optimize()

sigma_star = pynini.closure(byte.BYTE)
_email_star = pynini.cdrewrite(_email, "", "", sigma_star).optimize()
kylebgorman commented 3 years ago

I'd guess 0, 63. That's going to give you a huge automaton. Why not just replace that with *?

On Tue, Apr 27, 2021 at 10:46 AM anderleich @.***> wrote:

This is what I've managed to build to expand email addresses.

Why is this needing so much time and memory consuming? The string_maps are short. Are closures causing it?

_az = pynini.union(string.ascii_lowercase)_AZ = pynini.union(string.ascii_uppercase)_digits = pynini.union(*string.digits)_punctuation = pynini.stringmap([ ("", "underscore"), ("-", "dash"), ("+", "plus"), (".", "dot") ])_punctuation = pynutil.insert(" ") + _punctuation + pynutil.insert(" ")_alphanum = pynini.union(_az, _AZ, _digits) _local_part = _alphanum + pynini.union(_alphanum, _punctuation).closure(0,63)_domain_part = pynini.union(_alphanum, pynini.string_map([("-", " dash "), (".", " dot ")])).closure()#_domain_part_star = pynutil.join(_domain_part, pynini.cross(".", " dot ")).optimize() _at = pynutil.insert(" ") + pynini.cross("@", "at") + pynutil.insert(" ") _email = (_local_part + _at + _domain_part).optimize() sigma_star = pynini.closure(byte.BYTE)_email_star = pynini.cdrewrite(_email, "", "", sigma_star).optimize()

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-827664262, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKI6I6OD2PFFRDQZQ3TK3E55ANCNFSM43MG7BHQ .

anderleich commented 3 years ago

That was it, thanks!

So, it is better not to constrain local_part to be 64 characters long at maximum, isn't it?

kylebgorman commented 3 years ago

If it's not absolutely essential to your application and you don't want to wait for cdrewrite compilation to terminate, that's what I'd do.

On Tue, Apr 27, 2021 at 11:16 AM anderleich @.***> wrote:

That was it, thanks!

So, it is better not to constrain local_part to be 64 characters long at maximum, isn't it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/41#issuecomment-827690592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OIZDQMUHWY5B6MT6LLTK3INJANCNFSM43MG7BHQ .