s-yata / marisa-trie

MARISA: Matching Algorithm with Recursively Implemented StorAge
Other
506 stars 89 forks source link

Keyset causes bottleneck for large data sets? #18

Closed Krannich479 closed 5 years ago

Krannich479 commented 6 years ago

Hi, I am currently trying to insert every consecutive k-mer (subsequence of length k) of the human chromosome 21 into your trie structure. Since DNA has only a character alphabet of {A, C, G, T} I thought that might be a good use case for tries. With k=32 we are talking about ~55 million "words" to be inserted.

It seems that the bottleneck here is the marisa::Keyset, since this data structure obviously gets way too big for the RAM. I never even make it to the building step of marisa::trie. Is there a way to avoid building the full keyset first? Any chance to store data of that size with this library? I use pretty much the code from your README in section "Library->How to use".

s-yata commented 5 years ago

As you mentioned, marisa-trie requires much memory. It is not designed for such a large keyset.

How about using suffix tree instead of marisa-trie?