kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
118 stars 27 forks source link

ltr vs rtl rewrite_rule clarification #66

Closed aupuzikov closed 1 year ago

aupuzikov commented 1 year ago

hi guys,

can someone clarify the situation for example like this, for me the ltr and rtl direction doesn't matter it situation below, but it does, why? I think here is some conceptional misunderstanding, but I can't figure it out? help please =)

from pynini.lib import rewrite
import pynini
alpha = [
    'o',
    'a',
    'ob'
]
sigma = pynini.closure(pynini.union(*(alpha)))
# first ltr = OK
rr = pynini.cdrewrite(pynini.cross('a', ''), "", "", sigma, direction='ltr')
rewrite.top_rewrites("aob", rr, 10)
# second rtl  = FAILED
rr = pynini.cdrewrite(pynini.cross('a', ''), "", "", sigma, direction='rtl')
rewrite.top_rewrites("aob", rr, 10)

pynini version '2.1.5'

kylebgorman commented 1 year ago

Directionality is interpreted with respect to the structural environment: the left and right contexts. Since the structural environment here is null it has no effect.

aupuzikov commented 1 year ago

@kylebgorman greatings! i don't want to be rude, but I think you've closed it too early. There is no structural environment in the example for the sake of simplicity. Maybe I must provide the output. The issue I guess is with alphabet in which some keys can be a part of another, in previous example there were o and ob, and in this one it is bo and o respectively. In this last example the ltr version failed and rtl version succeeded, and I'm guessing it is something with ordering (bo, ob) of characters inside of alphabet keys.

>>> import pynini
>>> from pynini.lib import rewrite
>>> alpha = [
...     'bo',
...     'o',
...     'a',
... ]
>>>
>>> sigma = pynini.closure(pynini.union(*(alpha))).optimize()
>>>
>>> rr = pynini.cdrewrite(pynini.cross('a', ''), "", "", sigma, direction='ltr')
>>> rewrite.top_rewrites("abo", rr, 10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/apuzikov/jupyter-venv/lib/python3.6/site-packages/pynini/lib/rewrite.py", line 287, in top_rewrites
    lattice = rewrite_lattice(string, rule, input_token_type)
  File "/home/apuzikov/jupyter-venv/lib/python3.6/site-packages/pynini/lib/rewrite.py", line 96, in rewrite_lattice
    raise Error("Composition failure")
pynini.lib.rewrite.Error: Composition failure
>>> rr = pynini.cdrewrite(pynini.cross('a', ''), "", "", sigma, direction='rtl')
>>> rewrite.top_rewrites("abo", rr, 10)
['bo']
>>>
kylebgorman commented 1 year ago

I closed this because I didn't believe this was a bug in Pynini, just a tutorial question (and this is a bug tracker).

The fuller snippet gives me a picture of why you're seeing an error. Your sigma-star is not defined correctly. It makes no sense to say the alphabet is {a, bo, o}. However, if you modify it to {a, b, o} you will get a sensible rule, whether left-to-right or right-to-left. Rewrite your snippet with:

sigma = pynini.union("a", "b", "o").closure().optimize()

and the resulting rule in either direction will behave sensibly. (As for why it seems to work right-to-left I don't know but I don't think it's worth thinking too hard about either.)

aupuzikov commented 1 year ago

and the resulting rule in either direction will behave sensibly. (As for why it seems to work right-to-left I don't know but I don't think it's worth thinking too hard about either.)

but there are situations when you don't want to split compound keys to impartible sub-keys. For example: if I have some phoneme-alphabet with 'p:' phoneme, I don't want to have separated ':' key, cause there is no way separated semicolon will be a meaningful phoneme. What should I do in this case if I want to use some 'rtl' rewrite rules with this alphabet?

kylebgorman commented 1 year ago

IF you don't want to split them you have to tell Pynini to tokenize things not in bytes (as is the default) but according to your preferred tokenization scheme, by putting whitespace between each token and providing a symbol table. See the string processing docs or section 2.3 of Gorman & Sproat 2021. This is independent of the directionality question.