Memory Efficient Trie Creation

joe42 commented 10 years ago

Hi,

When creating a RecordTrie, the superclass _UnpackTrie unpacks all key value pairs in memory. So if I am correct, creating a Trie is not memory efficient at all. Is there a simple way to create large Tries more efficiently?

Thx, joe

derpston commented 10 years ago

I'm also very interested in this, I make some large tries with millions of entries. Some of them need 1.5gb of memory, so any reduction here would be nice.

kmike commented 10 years ago

_UnpackTrie packs values one-by-one (note it uses a generator). Its parents also use generators. The problem is that marisa-trie C++ library can use a lot of memory for trie creation: it requires all keys to be in memory (in its KeySet data structure), and trie building itself is memory-intensive.

I think the wrapper already does the right thing, and it is marisa-trie C++ library who uses memory, not the wrapper. I don't know simple ways to improve memory efficiency of trie building. If your trie is very large you can, for example, rent a high-memory EC2 instance from amazon for a short period of time - there are instances with 244 GiB of memory available, and if you use spot instances it could be pretty cheap (less than 1$ - for the last week average price was around 0.35 USD per hour).

There may be ways to improve memory usage of trie building in marisa-trie C++ library, but I don't know it well enough to comment on that. You may ask it in issue tracker of the C++ library - https://code.google.com/p/marisa-trie/.

If you need better memory efficiency at creation time you may try a different data structure. As usual, there are tradeoffs - most likely after building they'll use more memory than marisa-trie. Tries like biopython trie, datrie or hat-trie shouldn't require extra memory during trie creation, but most likely they'll use much more memory in the end. If you can properly sort your input https://github.com/kmike/DAWG can be pretty efficient at creation time and very compact after the creation, but it has a limitation on total number of items which you may hit if you have many millions of items (several millions are usually fine). Check input_is_sorted argument for DAWG constructors; you must sort data in alphabetical order, taking in account how are values converted to bytes if you use RecordDAWG.

If Python wrapper uses much more memory than marisa-trie console tools than that's a bug - please reopen the issue in this case.

pytries / marisa-trie

Memory Efficient Trie Creation #11