pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.03k stars 91 forks source link

Getting UnicodeDecodeError accessing trie read from file #18

Open jottos opened 9 years ago

jottos commented 9 years ago

Hi, I'm consistently getting the following error when trying to access a trie from a load or read from a file.

./read_trie_test.py
Traceback (most recent call last):
  File "./read_trie_test.py", line 18, in <module>
    print(t.restore_key(0))
  File "marisa_trie.pyx", line 324, in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:6365)
  File "marisa_trie.pyx", line 334, in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:6299)
  File "marisa_trie.pyx", line 62, in marisa_trie._get_key (src/marisa_trie.cpp:1615)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 10: invalid start byte

I get the same error if the following code is used...

  for k in t.keys():
      print(k)

and again the same error if I use:

  t['someKey']  # or t[u'somekey']

The trie file reads in w/o any error and i've written the file using both trie.save() and trie.write() and in writing file I've used a codec.open() and codec.write() to force utf-8 encoding

I'm not sure if this is similar issue #10

jottos commented 9 years ago

ok, never mind. I was taking the examples a little to litterally

so i was loading a BytesTrie() into a constructed Trie() - once I switched to a constructed BytesTrie() it worked fine

kmike commented 9 years ago

I'm glad it is not a bug in the marisa-trie source code :) Do you have any suggestions about how to change the docs to make them more clear regarding this?

jottos commented 9 years ago

So am I :)

so as for the documentation, at the end of the load/save section, I'd just call out, that the Trie() constructor will not load a RecordTrie or a BytesTrie even though it will not fail. You need to construct the Trie class that you are trying to load.

Alternatively, the load() methods could throw an exception if a trie file of the wrong type is presented.

rspeer commented 7 years ago

Part of the problem here is that the BytesTrie class should offer a static method for loading. The thought process that I think both jottos and I encountered was:

If you could call BytesTrie.load('trie.marisa') as a static method, it would be easier to not go astray.