Hi,
I am currently trying to insert every consecutive k-mer (subsequence of length k) of the human chromosome 21 into your trie structure. Since DNA has only a character alphabet of {A, C, G, T} I thought that might be a good use case for tries. With k=32 we are talking about ~55 million "words" to be inserted.
It seems that the bottleneck here is the marisa::Keyset, since this data structure obviously gets way too big for the RAM. I never even make it to the building step of marisa::trie. Is there a way to avoid building the full keyset first? Any chance to store data of that size with this library? I use pretty much the code from your README in section "Library->How to use".
Hi, I am currently trying to insert every consecutive k-mer (subsequence of length k) of the human chromosome 21 into your trie structure. Since DNA has only a character alphabet of {A, C, G, T} I thought that might be a good use case for tries. With k=32 we are talking about ~55 million "words" to be inserted.
It seems that the bottleneck here is the
marisa::Keyset
, since this data structure obviously gets way too big for the RAM. I never even make it to the building step ofmarisa::trie
. Is there a way to avoid building the full keyset first? Any chance to store data of that size with this library? I use pretty much the code from your README in section "Library->How to use".