Open caseybrown89 opened 8 years ago
It seems that your data reach the limitation of marisa-trie.
If you set num_tries
greater than 3 (DEFAULT_NUM_TRIES
), it might be able to avoid the limitation.
Please note that it is an ad hoc approach even if it goes well.
Thanks for the quick response, I'll give that shot. Do you think it is possible for this library to scale out to support bigger data sets? My naive thought is that I could try moving things from a 32 bit limit to a 64 limit? Do you think that would work? Thanks again.
The library is hard-coded to use UInt32
for the length. May I suggest accepting template arguments for the stored value and size storage as an enhancement? It would be a significant undertaking...
This limitation should be removed in future...
@s-yata Any chance this ancient issue will be addressed? I'm running into the same problem
at the risk of being an echo, would add that as datasets grow larger, more and more people will run into this issue. Marisa Trie is really great for my work, but on my latest project I've encountered this issue.
I have encountered this issue as well (for example when trying to build a trie of around 100m elements with 100 bytes each). I have noticed that the library is capable of creating files that are larger than 4GB (2^32 bytes).
Data has become bigger since 2016. This data structure is a real gem.
Anyone has ideas/suggestions on how to fix this UInt32 limitation? Is that a few hours/days of work or more? What needs to be done really? I have not done anything in C++ for a very long time (using the python bindings) but I would be happy to try/help with this issue. My end goal would be to be able to create tries of 10 to 100GB using python.
Thanks in advance for any help/pointers and congratulations to the author for an amazing library.
Hello Susumu,
We are using the Python Marisa trie wrapper (https://github.com/kmike/marisa-trie) which implements your library. The amount of data we've been placing in the trie has been increasing over time and the most recent trie generation caused the following overflow:
File "marisa_trie.pyx", line 422, in marisa_trie.BytesTrie.init (src/marisa_trie.cpp:7670) File "marisa_trie.pyx", line 127, in marisa_trie.Trie._build (src/marisa_trie.cpp:2768) RuntimeError: lib/marisa/grimoire/trie/tail.cc:192: MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX
If there's any more info you need please let me know!