s-yata / marisa-trie

MARISA: Matching Algorithm with Recursively Implemented StorAge
Other
507 stars 89 forks source link

MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX #1

Open caseybrown89 opened 8 years ago

caseybrown89 commented 8 years ago

Hello Susumu,

We are using the Python Marisa trie wrapper (https://github.com/kmike/marisa-trie) which implements your library. The amount of data we've been placing in the trie has been increasing over time and the most recent trie generation caused the following overflow:

File "marisa_trie.pyx", line 422, in marisa_trie.BytesTrie.init (src/marisa_trie.cpp:7670) File "marisa_trie.pyx", line 127, in marisa_trie.Trie._build (src/marisa_trie.cpp:2768) RuntimeError: lib/marisa/grimoire/trie/tail.cc:192: MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX

If there's any more info you need please let me know!

s-yata commented 8 years ago

It seems that your data reach the limitation of marisa-trie.

If you set num_tries greater than 3 (DEFAULT_NUM_TRIES), it might be able to avoid the limitation. Please note that it is an ad hoc approach even if it goes well.

caseybrown89 commented 8 years ago

Thanks for the quick response, I'll give that shot. Do you think it is possible for this library to scale out to support bigger data sets? My naive thought is that I could try moving things from a 32 bit limit to a 64 limit? Do you think that would work? Thanks again.

mikepb commented 8 years ago

The library is hard-coded to use UInt32 for the length. May I suggest accepting template arguments for the stored value and size storage as an enhancement? It would be a significant undertaking...

s-yata commented 5 years ago

This limitation should be removed in future...

dkoslicki commented 4 years ago

@s-yata Any chance this ancient issue will be addressed? I'm running into the same problem

lacerda commented 4 years ago

at the risk of being an echo, would add that as datasets grow larger, more and more people will run into this issue. Marisa Trie is really great for my work, but on my latest project I've encountered this issue.

erpic commented 2 years ago

I have encountered this issue as well (for example when trying to build a trie of around 100m elements with 100 bytes each). I have noticed that the library is capable of creating files that are larger than 4GB (2^32 bytes).

Data has become bigger since 2016. This data structure is a real gem.

Anyone has ideas/suggestions on how to fix this UInt32 limitation? Is that a few hours/days of work or more? What needs to be done really? I have not done anything in C++ for a very long time (using the python bindings) but I would be happy to try/help with this issue. My end goal would be to be able to create tries of 10 to 100GB using python.

Thanks in advance for any help/pointers and congratulations to the author for an amazing library.