pytries / datrie

Fast, efficiently stored Trie for Python. Uses libdatrie.
http://pypi.python.org/pypi/datrie/
GNU Lesser General Public License v2.1
530 stars 88 forks source link

Delete method __delitem__ hidden (but maybe for a good reason?) #12

Open ianozsvald opened 10 years ago

ianozsvald commented 10 years ago

datrie.pyx specifies a delitem magic method, I can call it explicitly to remove an item from the Trie.

I don't understand why there isn't a public (non-double-underscore) function to delete an item from the Trie, or at least a mention in the README about how to delete items.

Reviewing libdatrie it looks as though deletion is fully supported. Perhaps I'm missing a good reason why this method is kept hidden?

Sidenote - many thanks for the superb wrapper, you've made my bioinformatics boss a happy man.

kmike commented 10 years ago

Hey Ian,

Wrapper uses dict-like interface for that: del trie['key'] should work. You're right that it is better to document that, it is indeed not documented . I'm glad to make your bioinformatics boss happy :)

I'm sure you already know that, but just in case - please note that random insertions and deletions could be slow because of how trie is stored: these operations could require moving large chunks of memory. If fast random inserts/deletions are needed it is better to use some pointer-based trie, e.g. Trie from BioPython.

ianozsvald commented 10 years ago

ha I never thought to try del trie[u'key']. I live and learn! Perhaps you could add a note about deletion to the README? It might save someone else from the same head scratching.

Re. random inserts - actually I have that problem. I have 50,000 patterns of 8 to 2000 characters each, it takes 22 minutes to build.

I've tried sorting by key length+alphabetically (and each separately) but that doesn't improve the build time, these variants take the same time to construct the Trie.

I'm interested in building once and then reading it many times. I can live with a 22 min build time (I only do it once and then serialize) but I wonder if there's a way to structure or initialize the data to make it build faster?

Perhaps I can hint about the expected maximum size of the structure to ease memory pre-allocation?

kmike commented 10 years ago

You're right, I should add a note about del to README.

Hmm, sorting all data alphabetically before the insertion should make building as fast as possible. I'm not sure why doesn't it help for your case.

If you're interested in building once and reading many times, you may find https://github.com/kmike/DAWG useful: it should be faster than datrie and uses way less memory in most cases; it also doesn't have alphabet limitation. Not all the datrie API is supported, most notably advanced itertion. Nothing prevents advanced iteration support to be implemented, but nobody wrote it yet.

For static data https://github.com/kmike/marisa-trie is also an option; it is much slower, but it can handle huge datasets, and uses even less memory (depending on data, of course). Even less of datrie API is supported though.

kmike commented 10 years ago

I think the issue with datrie is not with memory allocation, but with copying of data, so preallocation won't help.

ianozsvald commented 10 years ago

I benchmarked DAWG and Marisa-trie last week, datrie is superior for read performance (it is 2-3* faster than both DAWG and Marisa-trie) - I'm doing a prefix search.

With a smaller dataset (about 49,000 items) it builds in a comparative time (about 2 seconds) to DAWG/Marisa-Trie, it is only when using a different & larger (53k) dataset that the build times for datrie gets much slower (22 minutes). With the same 53k dataset Marisa-trie takes 0.1s to build, DAWG takes 4s to build (i.e. they're both consistent, only datrie is inconsistent).

In all cases memory usage seems to be relatively small so I don't worry about it.

Re. sorting - if I sort by key-length and alphabet then the first 50k items are inserted into the datrie in under a second, then it progressively slows down for the remaining 3k items.

If I reverse the list (to insert with longest patterns with highest letters first) then it starts very slowly and suddenly completes very quickly - it takes 20 mins to do the first 4,000 longest items, then 2 seconds to do the remaining 49,000 shorter items.

Looking at my current dataset I have 1000 items between 60,000 and 2,000 characters in length, with 52k items <2,000 characters. Your examples talk about using dictionaries of words. Do you think long sequences (e.g. 2,000-60,000 characters) with a small alphabet (GATC) pose a particular problem for this implementation of datrie?

By breaking down by key length I can see that it takes:

EDIT - for completeness I'll note that BioPython's Trie builds the same 53k index in 0.2s (like Marisa-Trie) and searches it in 0.3s (slower than datrie, DAWG, Marisa-Trie).

kmike commented 10 years ago

That's interesting, thanks for sharing.

Unlike other implementations, libdatrie stores key "tails" in a separate data structure. So shared prefixes are stored in Double-Array Trie, but suffixes that are not shared are stored separately as raw strings.

This could explain why datrie is faster for longer keys: in many cases there is no need to walk the trie after walking some chars at the beginning; doing prefix search could be as easy as increasing a pointer. I'm not sure this is exactly the reason datrie is faster for you, but it can be the reason. Trie from BioPython should have a similar property, by the way, because it looks like an implementation of Patricia Trie (where transitions are labelled with strings, not with individual chars).

But when a new key is added 'tail' sometimes needs to be split, even if keys are sorted alphabetically. And for longer keys this requires moving larger blocks of memory. This could explain the slowdown that other implementations don't have, and why could this slowdown depend on whether keys are sorted by length or not (splits are easier if you start trie building with shorter keys).

Maybe it could help building speed to extract the tails in a single step after trie building, but it'll require changes to libdatrie.

Maybe @thep has some ideas?