kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
120 stars 26 forks source link

Operation failed with UTF8 character #17

Closed Luvata closed 4 years ago

Luvata commented 4 years ago

I'm learning pynini to map character number to syllabel, but I always got "Operation failed" when my fst2 in transducer contain "ộ" character, even though I passed token_type='utf8' on both transducer and stringify.

Here is my code

import pynini

ones_map = pynini.union(
    pynini.transducer("1", "một", token_type='utf8'),
    pynini.transducer("2", "hai", token_type='utf8'),
    pynini.transducer("3", "ba", token_type='utf8'),
)

chars = [chr(i) for i in range(1, 91)] + [r"\[", r"\\", r"\]"] + [chr(i) for i in range(94, 256)]
sigma_star = pynini.union(*chars).closure()
numbers = pynini.union("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")

num_norm = (pynini.cdrewrite(ones_map, "", "", sigma_star))

def normalize(string):
    return pynini.compose(string.strip(), num_norm).stringify(token_type='utf8')

print(normalize("1"))  # Operation failed
print(normalize("2"))  # Success, output "hai"
kylebgorman commented 4 years ago

Hi, thanks for sending a detailed bug report.

So the issue with your code is that sigma_star is delimited in terms of bytes but your transducers in ones_map uses UTF-8 codepoints. Because of this, sigma_star cannot actually accept 'ộ'.

You can go one of two ways. Either you can list codepoints like ộ when constructing sigma_star, or you can use the default byte token type.

A recommended in-line assertion test:

assert pynini.matches('ộ', sigma_star)

On Tue, Nov 12, 2019 at 3:30 AM Lê Thành notifications@github.com wrote:

I'm learning pynini to map character number to syllabel, but I always got "Operation failed" when my fst2 in transducer contain "ộ" character, even though I passed token_type='utf8' on both transducer and stringify.

Here is my code

import pynini

ones_map = pynini.union(

pynini.transducer("1", "một", token_type='utf8'),

pynini.transducer("2", "hai", token_type='utf8'),

pynini.transducer("3", "ba", token_type='utf8'),

)

chars = [chr(i) for i in range(1, 91)] + [r"[", r"\", r"]"] + [chr(i) for i in range(94, 256)]

sigma_star = pynini.union(*chars).closure()

numbers = pynini.union("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")

num_norm = (pynini.cdrewrite(ones_map, "", "", sigma_star))

def normalize(string):

return pynini.compose(string.strip(), num_norm).stringify(token_type='utf8')

print(normalize("1")) # Operation failed print(normalize("2")) # Success, output "hai"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/17?email_source=notifications&email_token=AABG4OJPVRZIIBI5SSUYTO3QTJSSHA5CNFSM4JL76W3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYT6GGQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OJ67IUACKO7XMAVEYLQTJSSHANCNFSM4JL76W3A .

Luvata commented 4 years ago

Thank you for pointing that out to me, so my quick fix is:

chars = [chr(i) for i in range(1, 91)] + [r"\[", r"\\", r"\]"] + [chr(i) for i in range(94, 256)]
chars += [bytes(i, "utf8") for i in "aáàạãảăắằặẵẳâấầậẫẩbcdđeéèẹẽẻêếềệễểghiíìịĩỉklmnoóòọõỏôốồộỗổơớờợỡởpqrstuúùụũủưứừựữửvxyýỳỵỹỷfjzw"]
chars = set(chars)
sigma_star = pynini.union(*chars).closure()

and also remove all token_type='utf8', and it works seamlessly :dancer:

Once again, thank you for your awesome library

kylebgorman commented 4 years ago

Glad it works! Peace.

On Tue, Nov 12, 2019 at 12:52 PM Lê Thành notifications@github.com wrote:

Closed #17 https://github.com/kylebgorman/pynini/issues/17.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/17?email_source=notifications&email_token=AABG4OPP5G6USO5KXXA5ZODQTLUMLA5CNFSM4JL76W3KYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOUZYP2XQ#event-2792422750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4ONQ43SGGZO6ESAFH7LQTLUMLANCNFSM4JL76W3A .