pytries / datrie

Fast, efficiently stored Trie for Python. Uses libdatrie.
http://pypi.python.org/pypi/datrie/
GNU Lesser General Public License v2.1
530 stars 88 forks source link

KeyError after insertion with large alphabet #10

Open zseder opened 11 years ago

zseder commented 11 years ago

Hello!

I have a huge alphabet, its size is 10970. I created a Trie with that alphabet and then inserted some words, but some of them gave me a KeyError after insertion and I don't know how to investigate the problem. My code is:

from collections import defaultdict
import datrie
import cPickle

a = cPickle.load(open("alphabet"))

d = datrie.Trie(a)
with open("freebase_list.head") as f:
  for l in f:
    try:
      le = l.strip().decode("utf-8", "ignore").split("\t")
      if len(le) != 3: continue
      w = le[0]
      if not w in d:
        d[w] = defaultdict(list)
      try:
        d[w][le[1]].append(le[2])
      except KeyError:
        print "KE", repr(w)
        break
    except UnicodeDecodeError:
      continue

And my two supplementary files are here: http://www.ilab.sztaki.hu/~zseder/freebase_list.head http://www.ilab.sztaki.hu/~zseder/alphabet

with these two files one can see, that at the last line of input, we got a KeyError right after insertion. I did another test with removing a character from the alphabet occuring in this line, \u968a, but then I'll have another error, with another line, so instead of going on and collecting all the crashing examples, I hope this information is enough for more debugging for you.

Thanks!

kmike commented 11 years ago

Hello!

Unfortunately 256 is a hard limit on alphabet in underlying libdatrie library. It seems that there are missing checks for that both in the wrapper and in the library. So the first step could be to add those checks.

The second step could be to remove this limitation. I think the best way to do that is to make libdatrie unicode support more flexible or just remove built-in unicode support from libdatrie (and encode data to utf8 at wrapper side - this will make trie faster and will remove the need to specify alphabet). This should require some collaboration with libdatrie author because it is not good to bundle heavily modified library version.

Currently if you need |alphabet| > 256 you have to use some hacks (e.g. artificially map your data to 256-based alphabet using encode/decode tricks) or use an another library (e.g. Trie from BioPython or https://github.com/kmike/hat-trie or https://github.com/kmike/marisa-trie or https://github.com/kmike/DAWG)

zseder commented 11 years ago

Thank you! I would suggest you as 0th step to put this hard limit info to the README.rst, that was one of my guesses of the problem, but since the readme said nothing about it, I thought it will be something else. Anyway, thanks for your help and suggestions of the other libs, I'll check them out. (I don't want to close this issue instead of you, but I think it can be closed)

Honghe commented 10 years ago

@kmike Thanks for your wrapper! May I know if you are going to expand the alphabet table? Both of length |alphabet| > 256 and the length of each letter of alphabet |letter| > 1 unicode? I think, your idea of to make libdatrie unicode support more flexible or just remove built-in unicode support from libdatrie is a good idea! Thanks again!

kmike commented 10 years ago

Hi @Honghe,

Ways to fix it:

1) Simply make |alphabet| larger, i.e. use 2 bytes instead of one - it will change memory requirements of datrie, and it will require maintaining a custom libdatrie fork. I think it is a no-starter. 2) Fix alphabet to 0-255 and encode/decode data ourselves in a wrapper. This would fix the issue, but it is suboptimal - there will be unnecessary work and so the wrapper may become slower. 3) Find a way to bypass alphabet handling and encode/decode data in a wrapper. 4) Collaborate on this issue with libdatrie C library author and find a way to make libdatrie unicode support more flexible.

(3) and (4) look like optimal solutions, but I don't have plans to implement any of this in near future - contributions are welcome.

Honghe commented 10 years ago

Thanks Kmike,

And one more question, I haven't found datrie has any API to get the children nodes of a given node, instead it only can iterate all items with a given prefix from a trie. So whether can we implement this functionality, libdatrie has the api or not?

On Thu, Feb 27, 2014 at 2:44 AM, Mikhail Korobov notifications@github.comwrote:

Hi @Honghe https://github.com/Honghe,

Ways to fix it:

1) Simply make |alphabet| larger, i.e. use 2 bytes instead of one - it will change memory requirements of datrie, and it will require maintaining a custom libdatrie fork. I think it is a no-starter. 2) Fix alphabet to 0-255 and encode/decode data ourselves in a wrapper. This would fix the issue, but it is suboptimal - there will be unnecessary work and so the wrapper may become slower. 3) Find a way to bypass alphabet handling and encode/decode data in a wrapper. 4) Collaborate on this issue with libdatrie C library author and find a way to make libdatrie unicode support more flexible.

(3) and (4) look like optimal solutions, but I don't have plans to implement any of this in near future - contributions are welcome.

— Reply to this email directly or view it on GitHubhttps://github.com/kmike/datrie/issues/10#issuecomment-36161191 .

Honghe

kmike commented 10 years ago

"Given prefix" defines a node, and items with a given prefix are paths to this node's children from the root of the trie. Do you want a method that returns suffixes instead of full keys? If so, use suffixes method. Or do you want a method that return walkable chars for a given prefix? There is such function in libdatrie (trie_state_walkable_chars), but it is not exposed in a wrapper. What is your use case? Can you please explain what you want in more details, maybe I misunderstood you?

Honghe commented 10 years ago

@kmike Just as code pasted bellow, use two methods to get the children of node a, but they all return node prefixed with node a, but really want is return only node ab and node ad, the direct children of a. Do I explain OK? Thanks!

In [61]: import datrie
In [62]: from datrie import Trie
In [63]: import numpy as np
In [64]: import string
In [65]: trie = Trie(string.ascii_lowercase)
In [66]: trie[u'a'] = 1
trie[u'abc'] = 3
In [67]: trie[u'ab'] = 2
In [68]: trie[u'abc'] = 3
trie[u'ad'] = 4
trie[u'adf'] = 5
In [69]: trie[u'ad'] = 4
In [70]: trie[u'adf'] = 5

# method one
In [71]: trie.items(u'a')
Out[71]: [(u'a', 1), (u'ab', 2), (u'abc', 3), (u'ad', 4), (u'adf', 5)]

# method two
In [72]: state = datrie.State(trie)
In [73]: state.walk(u'a')
Out[73]: True
In [74]: it = datrie.Iterator(state)
while it.next():
In [75]: while it.next():
   ....:         print it.key(), it.data()
   ....:     
 1
b 2
bc 3
d 4
df 5
kmike commented 10 years ago

If there is 'aaa' string, but not 'aa', do you want to return 'aa'? Note that in this case 'aa' path also defines a node, and this node is a direct child of a node defined by path 'a', but this node is not terminal.

Honghe commented 10 years ago

In your current implement of datrie, if I only inert a and aaa but no aa, it only return aaa. Just as you say, in trie aa path alse defines a node, and this node is a direct child of a node defined by path 'a', but this node is not terminal. And my aim is to return terminal children, Cause the terminal has value to be used. Just as code bellow:

In [6]: trie[u'a'] = 1
In [10]: trie[u'aa'] = 2
In [11]: trie[u'aaa'] = 3
In [12]: trie[u'aaaaa'] = 5

In [14]: trie.suffixes(u'a')
Out[14]: [u'', u'a', u'aa', u'aaaa']

In [15]: trie.items(u'a')
Out[15]: [(u'a', 1), (u'aa', 2), (u'aaa', 3), (u'aaaaa', 5)]

On Thu, Feb 27, 2014 at 11:37 AM, Mikhail Korobov notifications@github.comwrote:

If there is 'aaa' string, but not 'aa', do you want to return 'aa'? Note that in this case 'aa' path also defines a node, and this node is a direct child of a node defined by path 'a', but this node is not terminal.

— Reply to this email directly or view it on GitHubhttps://github.com/kmike/datrie/issues/10#issuecomment-36207089 .

Honghe

Honghe commented 10 years ago

@kmike so how can I do to get terminal children? Thanks!

kmike commented 10 years ago

The easiest way would be to just use items which has len(key)==len(prefix)+1. It will give you right items, but will do some unnecessary computations.

We may also expose trie_state_walkable_chars as a method/property of TrieState (https://github.com/kmike/datrie/blob/master/src/datrie.pyx#L759). This way you'll be able to get a list of walkable chars, walk each char and check if the resulting state is terminal. Depending on data, this could be slower than filtering items because of Python overhead, but this method will allow to avoid unnecessary walking / building of long keys.

The third option is to implement a method that will do exactly what you want as a Trie method. Could you please explain your use case? I still don't understand why do you want to find such items. Is this task general enough to deserve a method?

Honghe commented 10 years ago

Yes, also think of the first way, it is computational.

In our case, we want to count the distribution of a given char[s]. Given a string aaabbabbabd, we count the appearance number of substring, just as a appears 5 times, aa combination appears twice, ab combination appear 3 times, et al. just as bellow

#word counts
a    5
aa  2
ab  3
abb 2
abd 1
...

After have added these words to trie, if the trie has the method to get the direct terminal children, it is easy to get as bellow:

[In] : trie.terminal_children()
[Out] : ['aa': 2, 'ab': 3]

And trie.terminal_children() will not return abb, abd et al. Cause we only want to get direct terminal children.

In our field, it is often to get the node's direct terminal children and direct children (terminal and internal) with only one char longer. So if it is not easy to modify datrie, can you recommend any other trie/tree that can easily and fast handle this kind of children indices? Thanks!