s-yata / marisa-trie

MARISA: Matching Algorithm with Recursively Implemented StorAge
Other
506 stars 89 forks source link

Could Keyset (memory bottleneck) be bypassed by ordering input? #37

Open dfuhry opened 3 years ago

dfuhry commented 3 years ago

As already noted in (closed) issue #18 , marisa::Keyset poses an input size (memory) bottleneck for trie construction. dawgdic has a DawgBuilder class which accepts input in (LC_ALL=C, i.e. memcmp) sorted order, and builds their data structure directly from that to avoid the need for an in-memory input record buffer (which is what I understand Marisa's marisa::Keyset to be). Since we have good tools for scalable sorting (even out-of-core; e.g. GNU sort), is is possible a similar MarisaBuilder could accept input in some preprocessed (e.g. sorted) order, and avoid the need for a marisa::Keyset? I think some of us would be willing to externally preprocess our input data if so.