kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
120 stars 26 forks source link

Pynini istrings utf-8 errors #28

Closed tuyethai closed 4 years ago

tuyethai commented 4 years ago

Hello,

I have a problem when printing all shortest paths from FST (please see my code below). I guess that my input symbols contains some characters (not in utf-8 encodings). I would like to ask: Is there any way to get paths as list of input label ids instead of symbols? Thank you very much in advance.

Code shortest_filepath = "" LM = pynini.Fst.read(shortest_filepath)
for i in LM.paths().istrings(): print('path: ', i)

Error: File "pynini.pyx", line 2221, in istrings File "stringsource", line 38, in string.to_py.pyx_convert_PyUnicode_string_to_py_stdin_string UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 0: invalid start byte


python 3.7 openfst-1.7.4 Pynini 2.0.0


kylebgorman commented 4 years ago

Thanks for the clear report. I am assuming you have a real filename and not a blank string filename in your actual example...

First, try this:

for i in LM.paths(input_token_type="utf8").istrings():
    print("path: ", i)

If you really just want arc labels instead of strings, you can use ilabels(). That'll look a bit like this:

paths = LM.paths(input_token_type="utf8")  # The argument here is optional.
while not paths.done():
    print(paths.ilabels())
    paths.next()

You may also want to try updating to a newer Pynini; the current release is 2.1.2 and 2.0.0 came out over two years ago.

tuyethai commented 4 years ago

Thank you so much for your very quick reply. For the 1st suggestion, I still have the same error. Fortunately, my problem is solved with your 2nd suggestion. It's so great. I will update to the latest version as well. Many thanks again.