Determining what a transducer has accepted

anderleich commented 1 year ago

Hi,

I've successfully built a transducer to normalize numbers. However, besides the normalization string I would like to obtain the normalized original number. For example, this is what I'm getting currently:

10 --> ten
25 --> twenty five

I would like to get the oroginal number inside some sort of XML tags:

10 --> <x>10</x> ten
25 --> <x>25</x> twenty five

How can I achieve this?

Thanks

kylebgorman commented 1 year ago

I don’t think that’s possible in the general case with FSTs since it looks suspiciously like an extension of the “copy language”, which FSTs cannot compute. However in Python code you can always concatenate the input and the output strings along an FST. To get the input string, project onto the input and call .string() (if there is but one path) or the .paths() iterator (if there are many).

On Fri, Sep 23, 2022 at 5:36 AM anderleich @.***> wrote:

Hi,

I've successfully built a transducer to normalize numbers. However, besides the normalization string I would like to obtain the normalized original number. For example, this is what I'm getting currently:

10 --> ten 25 --> twenty five

I would like to get:

10 --> 10 ten

— Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/58, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OJCZ54D6HTA6XL55NTV7V2ZXANCNFSM6AAAAAAQTZ5NCY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

anderleich commented 1 year ago

Hi @kylebgorman ,

Thanks for your answer! Can you give a small example of the Python method you suggest? I don't understand whta you mean by project

Thanks

kylebgorman commented 1 year ago

Projection is an algorithm that converts a transducer to an acceptor over the domain or range of the FST.

Simple example that doesn't require projection:

lattice = istring @ rule
ostring = lattice.string()
your_desired_result = f"<x>{istring}</x> {ostring}"

Or if you've already computed the lattice, you can get istring via input project as follows:

istring = pynini.project(lattice, "input").string()

then compute your_desired_result as above

anderleich commented 1 year ago

Awesome that works as expected! However, if I use pynini.cdrewrite istring returns the whole sentence instead of the FST range.

rule = pynini.cdrewrite(pynini.cross("2", "two") + pynutil.insert(" ") + pynini.cross("3", "three"), "", "", byte.BYTE.closure())
istring = "233"
lattice = istring @ rule
pynini.project(lattice, "output").string()  # 'two three3'
pynini.project(lattice, "input").string() # '233'

The desired output would be:

<x>23</x>two three3

Is there any way I can achieve that using cdrewrite?

Thanks

kylebgorman commented 1 year ago

I don't know what an "FST range" is, but I don't see any way to do what you have in mind in a CDRewrite, I'm sorry. I think might be outside of the scope of what FSTs can compute.

There's an example of writing an "FST tagger" in chapter 7 (I think) of the Pynini book which may or may not help you.

On Fri, Sep 23, 2022 at 11:15 AM anderleich @.***> wrote:

Awesome that works as expected! However, if I use pynini.cdrewrite istring returns the whole sentence instead of the FST range.

rule = pynini.cdrewrite(pynini.cross("2", "two") + pynutil.insert(" ") + pynini.cross("3", "three"), "", "", byte.BYTE.closure()) istring = "233" lattice = istring @ rule pynini.project(lattice, "output").string() # 'two three3' pynini.project(lattice, "input").string() # '233'

The desired output would be:
23two three3 Is there any way I can achieve that using cdrewrite? Thanks — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you were mentioned.Message ID: ***@***.***>

anderleich commented 1 year ago

Mmm... I see. I think I might have not approached/explained the task in the right way at the first place.

What I am seeking is number normalization in text, where numbers can be anywhere on that text. Additionally, I need some sort of aligment between the normalized number and the original one to know where the normalization has been applied. So for instance the sentence There are 2 dogs in the street. would become There are two dogs in the street.. This is a trivial task which I've already managed to accomplish like this:

normalizer = pynini.cross("2", "two")
text_normalizer = pynini.cdrewrite(normalizer, "", "", byte.BYTE.closure())

In order to know which numbers in that sentence have been normalized, I would do the following:

normalizer = pynutil.insert("<x>") + pynini.cross("2", "two") + pynutil.insert("</x>")
text_normalizer = pynini.cdrewrite(normalizer, "", "", byte.BYTE.closure())

Which results in There are <x>two</x> dogs in the street.

Now what I'm missing is that the normalized number two belongs to the original number 2. So I need to somehow obtain that information either from the FST itself or programatically in Python by comparing the strings. The second option is the one that I'm trying to avoid if there is a more straightfroward way of doing it with Pynini. The method you mentioned in the previous post returned the whole sentence as I'm using cdrewrite. I just need the normalizer part.

So is there a way to get that done with the FST or should I move to Python?

Thank you for your time ;)

kylebgorman commented 1 year ago

If you take the text_normalizer and apply it via composition to a input string, then you can get the byte-by-byte alignment using the paths() method, which yields an iterator object with .ilabels() and .olabels() methods. Crucially these are lists of integers of the same length (with epsilons) so you can read the alignment off this.

istring = "There are 2 dogs in the street."
lattice = istring @ text_normalizer
paths = lattice.paths()
while not paths.done():
   ilabels = = paths.ilabels()
   olabels = paths.olabels()
   ...  # Do something with these two lists.
   paths.next()

Anyways, this is a pretty advanced feature and not terribly easy to use and out of scope for the issue tracker ;)

anderleich commented 1 year ago

That definitely works! That's what I needed.

The only thing I'm missing is the interger to char conversion. I've seen a solution here: https://github.com/kylebgorman/pynini/issues/55#issuecomment-1186517938

It seems if I add a character to the SymTable that it already has a fixed integer associated. The SymTable is not created by default so how can I be sure I've added all the necessary chars and no one is missing (all the accented chars for example)? If it already has intergers associated to chars why can't I just simply do the opposite (int2char)? I'm using symtable.find("a") method to convert the integers after having added them to the SymTable.

PD: Is there any forum or something similar I could ask these type of questions or doubts instead of using the issue tracker?

kylebgorman commented 1 year ago

The easiest thing to do here would be to use the built-in Python chr function which converts integers to the corresponding Unicode character string:

# Suppose we already have `ilabels`.
ichars = [chr(ilabel) for ilabel in ilabels if ilabel]

PD: Is there any forum or something similar I could ask these type of questions or doubts instead of using the issue tracker?

There used to be a Pynini forum on the website but nobody ever used it so I decomissioned it. I'm happy to take questions over email (within reason of course).

anderleich commented 1 year ago

I'm having some trouble converting back accented chars: dógs --> dÃ³gs. Those chars are broken in pieces due to encoding. How can I treat them correctly? I found there is a token_type="utf8" parameter but it is only available in acceptors (pynini.accep)

kylebgorman commented 1 year ago

You should use utf8 mode everywhere in your. That means instead of taking advantage of the implicit conversion from strings to byte-mode FSTs, youi have to explicitly compile them with accep or use the default_token_type context manager which changes the default token type mode for implicit conversions within a block. Read the in-module docs for the above (e.g., help(pynini.accep)), and/or chapter 2 of the book. Here is a description of what this is doing at the lowest level.

anderleich commented 1 year ago

Great! The context manager solved it without a significant change in the code

with pynini.default_token_type("utf8"):
    ... # Previous code

I'm closing this issue. Thanks for your time and explanations ;)

anderleich commented 1 year ago

Hi @kylebgorman ,

Just one little thing regarding the encoding. There seems to be some issues with some utf8 characters if I use context cdrewriter and BYTE.closure() as sigma star. See:

with pynini.default_token_type("utf8"): ("▁" @ pynini.cdrewrite(pynini.cross('▁', "_"), "", "", pynini.lib.byte.BYTE.closure())).paths().ilabels()
>>> []

If I set the defualt_token_type to byte instead of utf8, it returns the 3-byte utf8 character as expected:

with pynini.default_token_type("byte"): ("▁" @ pynini.cdrewrite(pynini.cross('▁', "_"), "", "", pynini.lib.byte.BYTE.closure())).paths().ilabels()
>>> [226, 150, 129]

Not using the cdrewrite returns the correct result, as I'm not using the sigma star:

with pynini.default_token_type("utf8"): ("▁" @ pynini.cross('▁', "_")).paths().ilabels()
>>> [9601]

It seems I need to change the sigma star. Which should I use to accept all utf8 characters?

kylebgorman commented 1 year ago

As the name suggests the definitions in the byte submodule are only appropriate for the byte mode. If you're using utf8 mode you should use the definitions in the utf8 submodule, or (and I suggest this) just compute the closure of the union of all the characters actually using.

anderleich commented 1 year ago

Great! Computing the closure of the union of all the characters should do the trick. I'll need to be cautious while hard-coding all the possible characters not to miss one though ;)

Thanks!

kylebgorman / pynini

Determining what a transducer has accepted #58