As already noted in (closed) issue #18 , marisa::Keyset poses an input size (memory) bottleneck for trie construction.
dawgdic has a DawgBuilder class which accepts input in (LC_ALL=C, i.e. memcmp) sorted order, and builds their data structure directly from that to avoid the need for an in-memory input record buffer (which is what I understand Marisa's marisa::Keyset to be).
Since we have good tools for scalable sorting (even out-of-core; e.g. GNU sort), is is possible a similar MarisaBuilder could accept input in some preprocessed (e.g. sorted) order, and avoid the need for a marisa::Keyset? I think some of us would be willing to externally preprocess our input data if so.
As already noted in (closed) issue #18 ,
marisa::Keyset
poses an input size (memory) bottleneck for trie construction. dawgdic has a DawgBuilder class which accepts input in (LC_ALL=C, i.e. memcmp) sorted order, and builds their data structure directly from that to avoid the need for an in-memory input record buffer (which is what I understand Marisa'smarisa::Keyset
to be). Since we have good tools for scalable sorting (even out-of-core; e.g. GNU sort), is is possible a similar MarisaBuilder could accept input in some preprocessed (e.g. sorted) order, and avoid the need for amarisa::Keyset
? I think some of us would be willing to externally preprocess our input data if so.