Trie has linear insertion time

electrofink commented 8 years ago

The trie created by datrie has linear insertion time, i.e. the more entries are already in the trie, the longer it takes. This means the insertion complexiy looks like O(n) to me, with n being the number of elements in the trie. A quick look at the literature suggests that insertion time for a trie should be O(m) instead, with m being the length of the key. This seems to be an an issue, because it basically makes the trie unusuable for any sufficient large collection of entries. For example, I'm trying to store 60 million database keys (strings with a maximum length of about 12 characters) in a trie. I'm using the following code to store the database keys in a trie to perform fast prefix operations on them (such as: "return all database keys that match a certain prefix"):

def create_trie(file):
    trie = datrie.Trie(string.ascii_uppercase + string.digits)
    i = 0
    start = datetime.datetime.now()
    with open(file) as database_keys:
        for line in database_keys:
            database_key = line.rstrip('\n')
            trie[database_key] = database_key
            i += 1
            if i % 10000 == 0:
                end = datetime.datetime.now()
                delta = end - start
                print('Processed lines: ' + str(i) + ". Time for 10000 lines: " + str(delta), flush=True)
                start = datetime.datetime.now()
    return trie

This code results in the following log file: trie_creation_log.txt As you can see, the insertion process gets slower and slower as more entries are added to the trie. Given the fact that I have 60 million database keys, it obviously is too slow for my use-case. I have seen trie implementations that don't suffer from this problem, so I wanted to make you aware of this :smiley: