pytries / datrie

Fast, efficiently stored Trie for Python. Uses libdatrie.
http://pypi.python.org/pypi/datrie/
GNU Lesser General Public License v2.1
530 stars 88 forks source link

trie.items(), trie.keys() len(trie) et al. may be pending while trie is empty #17

Closed Honghe closed 9 years ago

Honghe commented 10 years ago

There is no notice while pending. And even may cause Segment Fault.

superbobry commented 9 years ago

Same bug here. Calling __contains__ and len on an empty Trie causes a segfault.

Here's gdb where output:

#0  0x00007ffff6e48681 in memcpy () from /lib64/libc.so.6
#1  0x00007ffff090cdf8 in dstring_append_char (ds=0xf95d80, data=0x7fffffffd8cc) at libdatrie/datrie/dstring.c:157
#2  0x00007ffff090cf33 in trie_string_append_char (ts=<value optimized out>, tc=4 '\004') at libdatrie/datrie/trie-string.c:97
#3  0x00007ffff090a850 in da_first_separate (d=0xf39200, root=<value optimized out>, keybuff=0xf95d80) at libdatrie/datrie/darray.c:780
#4  0x00007ffff090be7b in trie_iterator_next (iter=0xf48ba0) at libdatrie/datrie/trie.c:998
#5  0x00007ffff08ffe8e in __pyx_pf_6datrie_8BaseTrie_24__len__ (__pyx_v_self=<value optimized out>) at src/datrie.c:3736
#6  __pyx_pw_6datrie_8BaseTrie_25__len__ (__pyx_v_self=<value optimized out>) at src/datrie.c:3662
#7  0x00007ffff7aea18c in builtin_len (self=<value optimized out>, v=<value optimized out>) at Python/bltinmodule.c:1317
#8  0x00007ffff7af42a2 in call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4021
#9  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2679
#10 0x00007ffff7af4c6e in PyEval_EvalCodeEx (co=0x7ffff7eddf30, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=1, kws=0x7ffff7eeba98, kwcount=0,
    defs=0x7fffe3810f98, defcount=2, closure=0x0) at Python/ceval.c:3265
#11 0x00007ffff7af32aa in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4129
#12 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4054
#13 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2679
#14 0x00007ffff7af4c6e in PyEval_EvalCodeEx (co=0x7ffff7e90130, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0,
    defcount=0, closure=0x0) at Python/ceval.c:3265
#15 0x00007ffff7af4d82 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>) at Python/ceval.c:667
#16 0x00007ffff7b14ba0 in run_mod (fp=0x6b9400, filename=<value optimized out>, start=<value optimized out>, globals=0x7ffff7f7e168, locals=0x7ffff7f7e168, closeit=1, flags=0x7fffffffdf80)
    at Python/pythonrun.c:1371
#17 PyRun_FileExFlags (fp=0x6b9400, filename=<value optimized out>, start=<value optimized out>, globals=0x7ffff7f7e168, locals=0x7ffff7f7e168, closeit=1, flags=0x7fffffffdf80)
    at Python/pythonrun.c:1357
#18 0x00007ffff7b14d7f in PyRun_SimpleFileExFlags (fp=0x6b9400, filename=0x7fffffffe390 "/tmp/grappa.py", closeit=1, flags=0x7fffffffdf80) at Python/pythonrun.c:949
#19 0x00007ffff7b2a664 in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:645
#20 0x00007ffff6dddd5d in __libc_start_main () from /lib64/libc.so.6
#21 0x0000000000400649 in _start ()

Update: turns out there's an unbounded loop in darray.c:770. Investigating further ...

superbobry commented 9 years ago

I'm not 100% sure this is correct, but it seems like max_c is 0 if the trie was just created. But when max_c is 0, the loop does a single iteration, thus c is never 0 and the exit condition fails.

diff --git a/libdatrie/datrie/darray.c b/libdatrie/datrie/darray.c
index c07712a..ea57ff3 100644
--- a/libdatrie/datrie/darray.c
+++ b/libdatrie/datrie/darray.c
@@ -774,7 +774,7 @@ da_first_separate (DArray *d, TrieIndex root, TrieString *keybuff)
                 break;
         }

-        if (c == max_c)
+        if (max_c == 0 || c == max_c)
             return TRIE_INDEX_ERROR;

         trie_string_append_char (keybuff, c);

I can submit a PR if you're okay with the "solution".


Why is max_c 0? If trie only consists of a root node, then root points at the last cell in the DA array, which is initialized as {base = DA_POOL_BEGIN, check = 0}. Substituting this into the if condition we get

da_get_check (d, base + c) == root
// => da_get_check(d, base) == root
// => 0 == 2  (see da_get_root() for the 2)
kmike commented 9 years ago

Hi @superbobry,

I think it is better to change the upstream code (http://linux.thai.net/~thep/datrie/datrie.html) if there is a bug.

superbobry commented 9 years ago

Yup, submitting the bug to the libdatrie author was my first thought, but the project page lacks a link to the bug tracker or repo. Should I email the author directly?

kmike commented 9 years ago

Yes, I've emailed him in past; we discussed some libdatrie changes needed for this wrapper and he made them in an original repo. A couple of bugs was also fixed this way.

There is SVN repo: http://linux.thai.net/svn/software/datrie/?q=svn/software/datrie - it seems our copy is a bit outdated. There is also one incompatibility between SVN version and our version - here we have one extra method (https://github.com/kmike/datrie/commit/dbe366b498fe13abdc944b5e4eeff331d9063c1a). In past I've just made sure it is kept when libdatrie is updated.

superbobry commented 9 years ago

The bug has been fixed in the SVN version. Should we wait for the next release or pull the SVN version into the repo instead?

kmike commented 9 years ago

:+1: we should pull the SVN version.