pytries / datrie

Fast, efficiently stored Trie for Python. Uses libdatrie.
http://pypi.python.org/pypi/datrie/
GNU Lesser General Public License v2.1
530 stars 88 forks source link

Trie has linear insertion time #32

Open electrofink opened 8 years ago

electrofink commented 8 years ago

The trie created by datrie has linear insertion time, i.e. the more entries are already in the trie, the longer it takes. This means the insertion complexiy looks like O(n) to me, with n being the number of elements in the trie. A quick look at the literature suggests that insertion time for a trie should be O(m) instead, with m being the length of the key. This seems to be an an issue, because it basically makes the trie unusuable for any sufficient large collection of entries. For example, I'm trying to store 60 million database keys (strings with a maximum length of about 12 characters) in a trie. I'm using the following code to store the database keys in a trie to perform fast prefix operations on them (such as: "return all database keys that match a certain prefix"):

def create_trie(file):
    trie = datrie.Trie(string.ascii_uppercase + string.digits)
    i = 0
    start = datetime.datetime.now()
    with open(file) as database_keys:
        for line in database_keys:
            database_key = line.rstrip('\n')
            trie[database_key] = database_key
            i += 1
            if i % 10000 == 0:
                end = datetime.datetime.now()
                delta = end - start
                print('Processed lines: ' + str(i) + ". Time for 10000 lines: " + str(delta), flush=True)
                start = datetime.datetime.now()
    return trie

This code results in the following log file: trie_creation_log.txt As you can see, the insertion process gets slower and slower as more entries are added to the trie. Given the fact that I have 60 million database keys, it obviously is too slow for my use-case. I have seen trie implementations that don't suffer from this problem, so I wanted to make you aware of this :smiley:

kmike commented 8 years ago

Yeah, this is a problem. See also discussion at https://github.com/pytries/datrie/issues/12.

electrofink commented 8 years ago

Thanks for the hint! The data I'm trying to insert is already sorted, in lexicographic order. I didn't check the C code for libdatrie, but I assume this behaviour arises from the implementation and not from the Python wrapper. Should I file a bug against libdatrie?

kmike commented 8 years ago

@semkath yes, firing a bug there is a good idea.

kmike commented 8 years ago

See also: https://github.com/scikit-learn/scikit-learn/issues/2639#issuecomment-30085346

electrofink commented 8 years ago

@kmike So, if I understand correctly: The double array implementation of a trie results in a more succinct data structure than the pointer-based implementation. The tradeoff for less memory usage is degraded behaviour when it comes to updating the trie? In my case, I don't really need to update the trie, but of course I need to build the trie. For comparison: The same process with a (completely different) trie implementation in Java (namely this one) goes over without a hitch, the average time to insert 10000 entries into the trie is about 5ms. I'm (unfortunately) not an expert in trie implementations but at least to me it was surprising that a trie implementation has O(n) insertion complexity when I expected O(m)

kmike commented 8 years ago

Yeah, it was also surprising to me that insertions are not O(1) in libdatrie. I think we should add this to the README, at least until there is a fix in libdatrie.

If you don't need to update the trie then check https://github.com/pytries/marisa-trie or https://github.com/pytries/DAWG; they use much less memory and don't have this issue.

kmike commented 8 years ago

Patricia trie (similar to the Java one) can be found in BioPython.