pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.04k stars 92 forks source link

Trailing \x02 byte on restore_key result #10

Closed mjwillson closed 10 years ago

mjwillson commented 10 years ago

Like so:

In [46]: marisa_trie.Trie([u'foo', u'bar']).restore_key(0)
Out[46]: u'bar\x02'

This doesn't happen if I first get the key_id for that key:

In [48]: t = marisa_trie.Trie([u'foo', u'bar'])

In [49]: t.key_id(u'bar')
Out[49]: 0

In [50]: t.restore_key(0)
Out[50]: u'bar'

If it's part of the contract that key_id is needed before restore_key then it should probably be documented, ideally raise some kind of exception if the contract is violated rather than silently return an incorrect result.

kmike commented 10 years ago

No, this is not a part of contract AFAIK, this looks like a bug.

mjwillson commented 10 years ago

Cheers Actually the thing with doing a key_id beforehand may be a red herring -- the bug seems to disappear with the following innocuous change too:

In [81]: marisa_trie.Trie([u'foo', u'bar']).restore_key(0)
Out[81]: u'bar\x02'

In [82]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
Out[82]: u'bar'

I'm guessing perhaps the Trie is getting garbage-collected in the first instance, but is returning a string whose memory is backed by that freed up space?

mjwillson commented 10 years ago

It seems a slightly weird intermittent (or at least hard to pin down what triggered it) bug anyway.

mjwillson commented 10 years ago

Not sure it's gc-related either as still happens if I gc.disable().

Sometimes it happens on all runs after the first run:

In [3]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
Out[3]: u'bar'

In [4]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
Out[4]: u'bar\x02'

Sometimes I'm getting this error too:

In [2]: t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-663a54335fc8> in <module>()
----> 1 t = marisa_trie.Trie([u'foo', u'bar']); t.restore_key(0)

/usr/local/lib/python2.7/dist-packages/marisa_trie.so in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:4794)()

/usr/local/lib/python2.7/dist-packages/marisa_trie.so in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:4728)()

UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 3: invalid start byte
kmike commented 10 years ago

This is reproducable - a weird bug! I'll try to get to it this weekend.

sisukapalli commented 10 years ago

Hi, I too had the same problem just now ended up at this page. I have a somewhat large trie (2G), and found that running under ipython was not working but running on command-line was fine:

The following two work fine: (1) echo 0 | marisa-reverse-lookup -r TRIEFILE.marisa
(2) python -e "from marisa_trie import Trie; print Trie().load('TRIEFILE.marisa').restore_key(0)"

The third one (in a running instance of IPython) fails: (3) marisa_trie.Trie().load('TRIEFILE.marisa').restore_key(0)

however, it works in a new IPython instance.

Nothing very informative, but one more data point.

kmike commented 10 years ago

Thanks for the extra info.

It is interesting that this issue can be reproduced in an IPython shell, but doesn't manifest itself in a regular Python shell. A test case for it also doesn't fail.

I tried different IPython versions; @mjwillson's example works fine in IPython 0.10 but fails in 0.11+.

kmike commented 10 years ago

Also, it works fine in IPython 1.1 under Python 3.3.

kmike commented 10 years ago

@mjwillson @sisukapalli thanks for the info! This bug should be fixed in 0.5.2. It turned out IPython vs python was a red herring: restore_key method was building the result incorrectly.

Maybe when code is executed in IPython memory layout is different and there are more non-zero bytes in memory - that could be a reason why the problem pops up only in IPython shell. When a byte after the string end is zero, restore_key method returned a proper result.